[SERVER-22389] Replica set member with data from backup fails to sync Created: 01/Feb/16  Updated: 09/Jun/16  Resolved: 04/Feb/16

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.2.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Yoni Douek Assignee: Scott Hernandez (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: HTML File conf     HTML File log     Zip Archive log.zip     Zip Archive m.zip     File metrics.2016-02-02T15-07-47Z-00000     File mongod-bsync-logging.gz     Text File mongod.log    
Issue Links:
Duplicate
duplicates SERVER-22456 The oplog find query timeout is too low Closed
Operating System: ALL
Participants:

 Description   

Our replica contains a primary, secondary and arbiter. Data is backed up periodically from the secondary using the recommended method (fsync lock & snapshotting). When we create a new member based on the backup - and add it to the set - it will not sync, showing this - endlessly without doing anything:

2016-02-01T05:33:38.739+0000 I REPL     [ReplicationExecutor] syncing from: in.db2m2.mydomain.com:27017
2016-02-01T05:33:43.812+0000 I REPL     [ReplicationExecutor] syncing from: in.db2m2.mydomain.com:27017
2016-02-01T05:33:48.885+0000 I REPL     [ReplicationExecutor] syncing from: in.db2m2.mydomain.com:27017

This happen for very recent backups (a few minutes old). Note that the data dir contains data which is identical to our secondary node, which does sync correctly, even if stopped.
This worked for us in 3.0.

Happens with protocol version 0 and 1.

A log with verbose level 5 is attached. We can easily reproduce it.

Please advise, the only way for us to add a member is to have a full initial sync which is unrealistic (our DB size is 8TB).



 Comments   
Comment by Ramon Fernandez Marina [ 13/Feb/16 ]

yonido, yesterday we released a 3.2.2-rc2 release candidate containing a fix for SERVER-22456, which addresses this issue. If you're interested in trying it out you can download it here – and if you find any issues please feel free to open new tickets as needed.

Thanks,
Ramón.

Comment by Yoni Douek [ 10/Feb/16 ]

Wanted to stress out how severe this bug is -
Our primary was under load, in pv0, so read from its oplog took more than 2s, then the secondary started lingering and then it's stuck in a mode that it can never sync - because any read will have to read a lot of data from oplog.

So, in pv0 - high load = replica out of sync, forever.
4th bug this week, are you guys doing ANY testing to sharded replicas? 2 seconds seems like a very arbitrary decision. Just like setting it to 2 minutes in the "fix".

Comment by Scott Hernandez (Inactive) [ 04/Feb/16 ]

We have now committed the change in the timeout for 3.2.2, and 3.3.2.

The old timeout in 3.0 was 30 seconds.

Comment by Yoni Douek [ 03/Feb/16 ]

What was the previous value? You can consider making it dynamic per query, based on the time distance between the optimes.

Comment by Scott Hernandez (Inactive) [ 03/Feb/16 ]

That is good to hear. We are thinking of going with 1 minute.

Comment by Yoni Douek [ 03/Feb/16 ]

Thanks. We tried 60 seconds (=30 seconds timeout) with no luck, but 120 (=60 seconds timeout) worked : )
That's nice. Thanks.

It always worked on 3.0 - wanted to double verify that the previous setting was indeed 30 seconds.

Comment by Scott Hernandez (Inactive) [ 03/Feb/16 ]

I've filed a new issue to address the code changes we are going to make to address this: SERVER-22456

The timeout for a pv1 find query is the electionTime/2, so 5 seconds, where the timeout for pv0 is 2s.

When replication issues the query we use a few special options to speed it up, so your query doesn't reflect the performance replication will see. Two special options are used: AwaitData, OplogReplay

Comment by Yoni Douek [ 03/Feb/16 ]

I've tried to run the query manually:

db.oplog.rs.find({ ts: { $gte: Timestamp(1454528396,636) } })

it's running for more than 5 minutes now. So I'm not sure that the switch from 30 to 5 seconds is the only thing here...

There's also no index on "ts" in the oplog.

Comment by Yoni Douek [ 03/Feb/16 ]

How is election timeout related?

Comment by Yoni Douek [ 03/Feb/16 ]

OK fair enough, we'll try to create a very tight backup for now.

We'll appreciate a fix on the next release.

Thanks for the super fast responsiveness!

Comment by Scott Hernandez (Inactive) [ 03/Feb/16 ]

We will be looking at solutions.

There are two workaround:

  • In ProtocolVersion 1 you can set the election timeout very high (like 60 seconds).
  • Sync from a WiredTiger node, if you have one.

If I discover more as I look at the code I'll post it here.

Comment by Yoni Douek [ 03/Feb/16 ]

Great, anything we can do with the existing build? change the maxTimeMS for this query somehow?

Not sure we'll be able to get the backup fast enough so that the query will return that fast.

Comment by Scott Hernandez (Inactive) [ 03/Feb/16 ]

Yep, looks that way. In 3.2 we have decreased the timeout from 30 seconds to 5 second for oplog queries, which is needed for liveness information and additional features so that is the underlying issue.

Thanks for the help and logs.

I will see what we can do about this, but I think at the moment this is going to be limited to MMAP sync sources (where you need to go into the middle of the oplog) because WiredTiger has an index internally, so this won't be a problem.

Comment by Yoni Douek [ 03/Feb/16 ]

I think this is the source of the problem:

command local.oplog.rs command: find { find: "oplog.rs", filter: { ts: { $gte: Timestamp 1453367016000|1072 } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 5000, term: 13 } keyUpdates:0 writeConflicts:0 exception: operation exceeded time limit code:50 numYields:79309 reslen:126 locks:{ Global: { acquireCount: { r: 158620 } }, MMAPV1Journal: { acquireCount: { r: 79310 } }, Database: { acquireCount: { r: 79310 } }, oplog: { acquireCount: { R: 79310 } } } protocol:op_command 5066ms

Looks like it takes 5 seconds to read the oplog in the source node. Maybe this happens when the member is too far behind and has a lot to read (it has a few hours to sync in our case), and therefore happens only in backups that are far behind. This didn't happen in 3.0 though...

Diagnostic data from the source node attached, let me know if u need anything else.

Thanks!

Comment by Scott Hernandez (Inactive) [ 03/Feb/16 ]

Okay, that was somewhat useful and points to the upstream server (the sync source of in.db2m1.roi.test.mydomain.com:27017) as not being able to return any data within the timeout.

2016-02-03T17:56:16.764+0000 D REPL     [rsBackgroundSync-0] Error returned while fetching the oplog: ExceededTimeLimit: operation exceeded time limit

Can you post a section of the logs from that server? If the problem is still happening any section should do as this is probably happening constantly (every few seconds). Also, can you provide the diagnostic.data directory also?

Thanks,
Scott

Comment by Yoni Douek [ 03/Feb/16 ]

Attached - a log from the version u provided (thanks!) with log level 2.
The node has been added right after being started and stuck in the "syncing" loop.

Thanks,

Comment by Scott Hernandez (Inactive) [ 03/Feb/16 ]

Yes, some of this is off-topic, and I don't want to distract but here are some answers:

A member will not be chosen to sync from if there are no new writes out there, so it is expected for a node starting up to do that until it sees new writes. Not sure if you have constant writes or not during/after the backup/restore/new-member. Changing the protocolVersion (via a reconfig) will cause a write and as a consequence, new sync sources to be evaluated and choosen, possibly.

The term will only exist once the primary switches to protocolVersion 1, and each member will only record documents in "replset.election" once it has switched to electionProtocol 1 and been involved in an election. So it sounds very much like this is both expected and by design – not a bug.

Feel free to file more issues if you think you have found a bug, but for now let's try to diagnose your sync source problems. So far everything you have supplied seems to point to some issue in your environment related to that.

Comment by Yoni Douek [ 03/Feb/16 ]

Another unrelated question, and sorry for being off topic, we also had a small glitch here that caused a wrong member election. I may open a different bug. From reading logs it looks like it's related to a wrong replication 'term' number in some of the members.

Are you aware that when upgrading an RS that was created on 3.0 and updating its protocolVersion, the collection replset.election is not created and does not exist (and therefore term data doesn't exist) - unlike an RS that is initiated in 3.2+ in which it does exist? Is this by design?

Comment by Yoni Douek [ 03/Feb/16 ]

I'm on it right now, will update soon. I did find another bug though, maybe it's related?

1. Replica with no protocolVersion (not updated yet), running 3.2.1.
2. Add node running 3.2.1 from a backup that has no protocolVersion as well.
3. Node will start but will say it has "no member to sync from" and stay this way.
4. Only after updating to protocolVersion it will sync.

This practically means that you can't add nodes running 3.2 without updating the protocolVersion.

Comment by Scott Hernandez (Inactive) [ 03/Feb/16 ]

yonido, were you able to run the version I supplied? If so, what happened and can you provide the logs?

Comment by Scott Hernandez (Inactive) [ 02/Feb/16 ]

I was not able to reproduce what you are seeing, either part.

Can you post the logs from the initial sync member?

I have put together a new mongod (64-bit linux) with extra logging around some areas at logLevel 2 which should provide more information, can you run that once the node gets "stuck"? This is not a build that should be used once we get the extra logging or if the issue goes away, so once we have that info please shut the process down and go back to the real mongod file.

The filename attached is mongod-bsync-logging.gz

Also, can you try to add the member to the replica set before you take the snapshot and start the node? This will ensure that the replica set config has that member in the config when it first comes online – just an idea which might show more info in the logs.

Comment by Yoni Douek [ 02/Feb/16 ]

Hey,

Very bad news, this happens also when we add a node that is completely empty and finishes the initial sync. When it starts syncing from that point - it behaves in a similar way.

So we're basically stuck. Please help.

Comment by Yoni Douek [ 01/Feb/16 ]

1. Secondary.

2. No.

3+4. Attached (it was added / removed several times in that timeframe, LMK if u need anything cleaner).

5. Disk level snapshot, using AWS's EBS snapshot, so everything is copied (i know it's weird...).

Our hunch is that you added some identifier which is saved externally? Another hint is that the in log entry that is attached, the "optimes" array does not include the primary node, whereas in secondaries that do sync correctly - it does.

Comment by Scott Hernandez (Inactive) [ 01/Feb/16 ]

Understood, I will try to produce with my own reproduction. We already have tests which check this functionality, fsync+lock and copy files to add a new node from, so I'll try to change things which might be different like on your system.

Unless I can easily reproduce this I expect I will have a few questions come soon, but I let me start with a few basic ones, if you have a minute:

  1. The node you fsync+lock, is it the primary or a secondary?
  2. Is there an election or step-down immediately before the backup is taken
  3. Can you provide the logs from before the shutdown (I assume you copy the logs as well, maybe) and after the restart (which you provided a section of)?
  4. Can you provide the diagnostics.data directory contents on the new node?
  5. How do you do your backup and what files are included/excluded?

Thanks in advance,
Scott

Comment by Yoni Douek [ 01/Feb/16 ]

Hey Scott,

We don't have a test to share, we do have a test environment with 3 shards, and this happens on each one every time so we can reproduce it easily if you need anything from us (logs, run a special version). We add members quite often (every ~2 weeks) to replace existing ones / maintenance - and used it more than 50 times on 3.0 with no issues, so this is new.

Comment by Scott Hernandez (Inactive) [ 01/Feb/16 ]

Hello Yoni,

Do you have a test that reproduces this, or do you mean that when you take a backup and add a new node with that data it happens frequently? And if the latter, how often do you add a new member this way?

--Scott

Comment by Yoni Douek [ 01/Feb/16 ]

Also adding our rs.conf(). The new member is the last one.

Generated at Thu Feb 08 04:00:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.