[SERVER-32542] MongoDB member repeatedly ends up in unrecoverable rollback state Created: 04/Jan/18  Updated: 27/Oct/23  Resolved: 26/Jan/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.4.9
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Max Bennedich Assignee: Judah Schvimer
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

We have a small three member replicaset running MongoDB 3.4.9:

  • DB-1. Physical local server, Windows Server 2012, 64 GB RAM, 6 cores. Hosted in Scandinavia.
  • DB-2. Amazon EC2, Windows Server 2016, r4.2xlarge, 61 GB RAM, 8 vCPUs. Hosted in Germany.
  • Arbiter. Tiny cloud based Linux instance.

DB-1 runs as the primary member, and has ended up in a ROLLBACK state roughly once per month ever since we set the replicaset up. A few times, the rollback has been successful, but usually it is unrecoverable, and we have to manually resync the data. We have now had 3 unrecoverable rollbacks in 8 days, and reached a point where we need to address this.

Find below the logs from 2 of our last 3 rollbacks (grepped for REPL and a few other relevant commands). During the Dec 28 rollback, we shut DB-1 down manually after 1.5 hours since the rollback didn't progress. During the Jan 4 rollback, the DB terminated itself. The replication lag at the time of the Dec 28 incident might have been as high as ~180 seconds, while it was ~3 seconds during the Jan 4 incident.

The rollback/ folder on DB-1 contains 436 files totalling 55 MB, but none of them are from the recent outages (they are all older than 2 months).

Any idea why we'd end up in ROLLBACK to begin with, and perhaps more importantly, why the rollback almost never succeeds?

Note: Timestamps between servers are in sync within +- 1 second.

December 28 incident:

DB-1:

2017-12-28T11:07:29.988 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 130
2017-12-28T11:07:29.989 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 130
2017-12-28T11:07:29.989 I REPL     [replExecDBWorker-0] transition to SECONDARY
2017-12-28T11:07:31.251 I REPL     [ReplicationExecutor] Member proddb2.xxxx.com:27017 is now in state PRIMARY
2017-12-28T11:07:37.232 I REPL     [rsBackgroundSync] sync source candidate: proddb2.xxxx.com:27017
2017-12-28T11:07:37.628 I REPL     [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: { ts: Timestamp 1514459257000|3547, t: 129 }. source's GTE: { ts: Timestamp 1514459258000|1, t: 130 } hashes: (-2389994255667895449/3792703541455281876)
2017-12-28T11:07:37.629 I REPL     [rsBackgroundSync] rollback 0
2017-12-28T11:07:37.629 I REPL     [rsBackgroundSync] transition to ROLLBACK
2017-12-28T11:07:37.640 I REPL     [rsBackgroundSync] beginning rollback
2017-12-28T11:07:37.658 I REPL     [rsBackgroundSync] rollback 1
2017-12-28T11:07:38.308 I REPL     [rsBackgroundSync] rollback 2 FindCommonPoint
2017-12-28T11:07:38.341 I REPL     [rsBackgroundSync] rollback our last optime:   Dec 28 11:07:37:ddb
2017-12-28T11:07:38.342 I REPL     [rsBackgroundSync] rollback their last optime: Dec 28 11:07:38:94
2017-12-28T11:07:38.342 I REPL     [rsBackgroundSync] rollback diff in end of log times: -1 seconds
2017-12-28T11:07:56.813 I REPL     [rsBackgroundSync] rollback common point is { ts: Timestamp 1514459067000|8080, t: 129 }
2017-12-28T11:07:56.813 I REPL     [rsBackgroundSync] rollback 3 fixup
2017-12-28T12:38:24.833 I COMMAND  [conn693343] terminating, shutdown command received
2017-12-28T12:38:24.834 I REPL     [conn693343] shutting down replication subsystems
2017-12-28T12:38:24.834 I REPL     [conn693343] Stopping replication reporter thread
2017-12-28T12:38:24.834 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update to proddb2.xxxx.com:27017: CallbackCanceled: Reporter no longer valid
2017-12-28T12:38:24.835 I REPL     [conn693343] Stopping replication fetcher thread
2017-12-28T12:38:24.858 I REPL     [conn693343] Stopping replication applier thread

DB-2:

2017-12-28T11:07:29.584 I REPL     [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
2017-12-28T11:07:29.584 I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected
2017-12-28T11:07:29.771 I REPL     [ReplicationExecutor] VoteRequester(term 129 dry run) received a yes vote from proddb-arbiter.xxxx.com:27017; response message: { term: 129, voteGranted: true, reason: "", ok: 1.0 }
2017-12-28T11:07:29.771 I REPL     [ReplicationExecutor] dry election run succeeded, running for election
2017-12-28T11:07:29.803 I REPL     [ReplicationExecutor] VoteRequester(term 130) received a yes vote from proddb-arbiter.xxxx.com:27017; response message: { term: 130, voteGranted: true, reason: "", ok: 1.0 }
2017-12-28T11:07:29.803 I REPL     [ReplicationExecutor] election succeeded, assuming primary role in term 130
2017-12-28T11:07:29.803 I REPL     [ReplicationExecutor] transition to PRIMARY
2017-12-28T11:07:29.803 I REPL     [ReplicationExecutor] Entering primary catch-up mode.
2017-12-28T11:07:29.884 I REPL     [ReplicationExecutor] Heartbeats updated catchup target optime to { ts: Timestamp 1514459257000|1022, t: 129 }
2017-12-28T11:07:31.804 I REPL     [ReplicationExecutor] Catchup timed out after becoming primary.
2017-12-28T11:07:31.804 I REPL     [ReplicationExecutor] Exited primary catch-up mode.
2017-12-28T11:07:31.924 I REPL     [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state SECONDARY
2017-12-28T11:07:32.632 I REPL     [rsSync] transition to primary complete; database writes are now permitted
2017-12-28T11:07:32.831 I REPL     [rsBackgroundSync] Replication producer stopped after oplog fetcher finished returning a batch from our sync source.  Abandoning this batch of oplog entries and re-evaluating our sync source.
2017-12-28T11:07:33.373 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update to proddb1.xxxx.com:27017: InvalidSyncSource: Sync source was cleared. Was proddb1.xxxx.com:27017
2017-12-28T11:07:38.329 I REPL     [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state ROLLBACK
2017-12-28T12:05:11.681 I REPL     [ReplicationExecutor] Error in heartbeat request to proddb1.xxxx.com:27017; ExceededTimeLimit: Operation timed out, request was RemoteCommand 443437739 -- target:proddb1.xxxx.com:27017 db:admin expDate:2017-12-28T12:05:11.680 cmd:{ replSetHeartbeat: "rs", configVersion: 66175, from: "proddb2.xxxx.com:27017", fromId: 3, term: 130 }
2017-12-28T12:05:13.930 I REPL     [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state ROLLBACK
2017-12-28T12:46:23.193 I REPL     [ReplicationExecutor] Error in heartbeat request to proddb1.xxxx.com:27017; HostUnreachable: An existing connection was forcibly closed by the remote host.

Arbiter:

2017-12-28T11:07:28.769 I COMMAND  [conn173] command local.replset.election command: replSetRequestVotes { replSetRequestVotes: 1, setName: "rs", dryRun: true, term: 129, candidateIndex: 2, configVersion: 66175, lastCommittedOp: { ts: Timestamp 1514459067000|8080, t: 129 } } numYields:0 reslen:63 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 } } } protocol:op_command 156ms
2017-12-28T11:07:30.871 I REPL     [ReplicationExecutor] Member proddb2.xxxx.com:27017 is now in state PRIMARY
2017-12-28T11:07:33.358 I REPL     [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state SECONDARY
2017-12-28T11:07:38.403 I REPL     [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state ROLLBACK
2017-12-28T12:46:22.078 I -        [conn165] end connection 158.174.52.213:50060 (3 connections now open)
2017-12-28T12:46:24.473 I ASIO     [ReplicationExecutor] dropping unhealthy pooled connection to proddb1.xxxx.com:27017
2017-12-28T12:46:24.474 I ASIO     [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2017-12-28T12:46:24.491 I ASIO     [ReplicationExecutor] Failed to close stream: Transport endpoint is not connected
2017-12-28T12:46:24.491 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to proddb1.xxxx.com:27017
2017-12-28T12:46:34.473 I REPL     [ReplicationExecutor] Error in heartbeat request to proddb1.xxxx.com:27017; ExceededTimeLimit: Couldn't get a connection within the time limit

January 4 incident:

DB-1:

2018-01-04T11:47:15.969 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 136
2018-01-04T11:47:15.969 I REPL     [ReplicationExecutor] stepping down from primary, because a new term has begun: 136
2018-01-04T11:47:15.969 I REPL     [replExecDBWorker-2] transition to SECONDARY
2018-01-04T11:47:16.551 I REPL     [ReplicationExecutor] Member proddb2.xxxx.com:27017 is now in state PRIMARY
2018-01-04T11:47:22.565 I REPL     [rsBackgroundSync] sync source candidate: proddb2.xxxx.com:27017
2018-01-04T11:47:22.863 I REPL     [rsBackgroundSync] Starting rollback due to OplogStartMissing: Received an empty batch from sync source.
2018-01-04T11:47:22.863 I REPL     [rsBackgroundSync] rollback 0
2018-01-04T11:47:22.863 I REPL     [rsBackgroundSync] transition to ROLLBACK
2018-01-04T11:47:22.873 I REPL     [rsBackgroundSync] beginning rollback
2018-01-04T11:47:22.892 I REPL     [rsBackgroundSync] rollback 1
2018-01-04T11:47:23.102 I REPL     [rsBackgroundSync] rollback 2 FindCommonPoint
2018-01-04T11:47:23.135 I REPL     [rsBackgroundSync] rollback our last optime:   Jan  4 11:47:23:516
2018-01-04T11:47:23.135 I REPL     [rsBackgroundSync] rollback their last optime: Jan  4 11:47:23:53
2018-01-04T11:47:23.135 I REPL     [rsBackgroundSync] rollback diff in end of log times: 0 seconds
2018-01-04T11:47:42.169 I REPL     [ReplicationExecutor] Error in heartbeat request to proddb-arbiter.xxxx.com:27017; ExceededTimeLimit: Operation timed out, request was RemoteCommand 1522187 -- target:proddb-arbiter.xxxx.com:27017 db:admin expDate:2018-01-04T11:47:50.168 cmd:{ replSetHeartbeat: "rs", configVersion: 66175, from: "proddb1.xxxx.com:27017", fromId: 2, term: 136 }
2018-01-04T11:47:44.259 I REPL     [ReplicationExecutor] Member proddb-arbiter.xxxx.com:27017 is now in state ARBITER
2018-01-04T11:47:45.785 I REPL     [rsBackgroundSync] rollback common point is { ts: Timestamp 1515066330000|1306, t: 135 }
2018-01-04T11:47:45.785 I REPL     [rsBackgroundSync] rollback 3 fixup
2018-01-04T11:59:36.995 I REPL     [rsBackgroundSync] rollback finished
2018-01-04T11:59:36.995 F REPL     [rsBackgroundSync] Unable to complete rollback. A full resync may be needed: UnrecoverableRollbackError: replSet too much data to roll back @ 18753
2018-01-04T11:59:36.996 I -        [rsBackgroundSync] Fatal Assertion 28723 at src\mongo\db\repl\rs_rollback.cpp 915

DB-2:

2018-01-04T11:47:15.179 I REPL     [ReplicationExecutor] Starting an election, since we've seen no PRIMARY in the past 10000ms
2018-01-04T11:47:15.179 I REPL     [ReplicationExecutor] conducting a dry run election to see if we could be elected
2018-01-04T11:47:15.208 I REPL     [ReplicationExecutor] VoteRequester(term 135 dry run) received a yes vote from proddb-arbiter.xxxx.com:27017; response message: { term: 135, voteGranted: true, reason: "", ok: 1.0 }
2018-01-04T11:47:15.208 I REPL     [ReplicationExecutor] dry election run succeeded, running for election
2018-01-04T11:47:15.239 I REPL     [ReplicationExecutor] VoteRequester(term 136) received a yes vote from proddb-arbiter.xxxx.com:27017; response message: { term: 136, voteGranted: true, reason: "", ok: 1.0 }
2018-01-04T11:47:15.239 I REPL     [ReplicationExecutor] election succeeded, assuming primary role in term 136
2018-01-04T11:47:15.239 I REPL     [ReplicationExecutor] transition to PRIMARY
2018-01-04T11:47:15.240 I REPL     [ReplicationExecutor] Entering primary catch-up mode.
2018-01-04T11:47:15.423 I REPL     [ReplicationExecutor] Heartbeats updated catchup target optime to { ts: Timestamp 1515066442000|1854, t: 135 }
2018-01-04T11:47:17.240 I REPL     [ReplicationExecutor] Catchup timed out after becoming primary.
2018-01-04T11:47:17.240 I REPL     [ReplicationExecutor] Exited primary catch-up mode.
2018-01-04T11:47:17.241 I REPL     [rsBackgroundSync] Replication producer stopped after oplog fetcher finished returning a batch from our sync source.  Abandoning this batch of oplog entries and re-evaluating our sync source.
2018-01-04T11:47:17.455 I REPL     [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state SECONDARY
2018-01-04T11:47:18.078 I REPL     [rsSync] transition to primary complete; database writes are now permitted
2018-01-04T11:47:18.916 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update to proddb1.xxxx.com:27017: InvalidSyncSource: Sync source was cleared. Was proddb1.xxxx.com:27017
2018-01-04T11:47:23.557 I REPL     [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state ROLLBACK
2018-01-04T11:59:49.050 I REPL     [ReplicationExecutor] Error in heartbeat request to proddb1.xxxx.com:27017; ExceededTimeLimit: Couldn't get a connection within the time limit

Arbiter:

2018-01-04T11:47:15.993 I REPL     [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state SECONDARY
2018-01-04T11:47:18.037 I REPL     [ReplicationExecutor] Member proddb2.xxxx.com:27017 is now in state PRIMARY
2018-01-04T11:47:25.998 I REPL     [ReplicationExecutor] Member proddb1.xxxx.com:27017 is now in state ROLLBACK
2018-01-04T11:59:47.341 I REPL     [ReplicationExecutor] Error in heartbeat request to proddb1.xxxx.com:27017; ExceededTimeLimit: Couldn't get a connection within the time limit



 Comments   
Comment by Max Bennedich [ 26/Jan/18 ]

Ok, thanks for looking into this. We'll work on avoiding network instability then, so that we don't end up in rollback mode to begin with. And upgrade to 3.6 when we have a chance.

Comment by Judah Schvimer [ 24/Jan/18 ]

The rollbacks are probably slow due to a slow network, since we have to fetch a lot of data across the network from the sync source (if we trust the server, 300MB). We don't actually write to the rollback files until after "too much data" would have been hit, so it makes sense that those would be empty. On 3.4.9 I cannot actually see what got fetched, but on 3.6 we can. If you were to upgrade a binary to 3.6 and the issue persisted, you could see what occurred by running the following before you hit a rollback:

db.setLogLevel(2, "replication.rollback" )

That being said, you may not be interested in upgrading. In that case I have no data that MongoDB is behaving incorrectly. We did a significant amount of work improving rollback in 3.6 and plan to continue improving it in the future.

Comment by Max Bennedich [ 24/Jan/18 ]

Hey Judah,

Thanks for your response! I think you are correct that the rollbacks are initiated by bad network connectivity. We have reached a similar conclusion here, and that answers the first of our questions.

However, the remaining question is whether the rollbacks work as intended? Why are the rollbacks so slow (see logs above)? Why was nothing written to the rollback folder? The last content we have in the rollback folder is from November, 2017. We've had several rollbacks since then, a few of them claiming "replSet too much data to roll back", others failing with other error messages.

During our last rollback, we have determined that we had a network loss of at most 20 seconds, and our logging shows that we had no replication lag at the time. I don't think that separation by itself (~20 seconds) could contain more than 300 MB of data, since we don't have that amount of traffic.

Max

Comment by Judah Schvimer [ 24/Jan/18 ]

Hi Max,

Thank you for filing this issue and sorry it took so long for a response. I think the rollbacks occur because the secondary stops seeing the primary and runs an election. Since you have an arbiter, it will always win. This is likely due to poor network connectivity, which is supported by the many connection failure and heartbeat timeout logs I see. Eventually the old primary steps down because it sees the new term, but by this point it may have accepted many operations that it will need to roll back. If there are enough operations, it will fail the rollback with the "too much data to roll back" error that we see: https://github.com/mongodb/mongo/blob/r3.4.9/src/mongo/db/repl/rs_rollback.cpp#L400-L402.

To stop seeing rollbacks, I'd investigate why there are so many network failures. MongoDB appears to be functioning as designed. Does this address your concerns?

Best,
Judah

Generated at Thu Feb 08 04:30:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.