[SERVER-31794] REPL opslog corrupts SECONDARY members. Created: 02/Nov/17  Updated: 25/Nov/18  Resolved: 25/Oct/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.2.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Andy Wallis Assignee: Kelsey Schubert
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Steps To Reproduce:

In a simple two member replset with a single arbiter:
As a temp collection is being written to the PRIMARY, force a shutdown of the SECONDARY.

Participants:

 Description   

When a server came down with a hard stop, it recovered and did a rollback and started to replicate the PRIMARY opslog. As it was replicating the log, it a writer worker exception error that caused a Fatal Assertion. Even after wiping the secondary, the replset replication hit the same error continually. The only solution was to nuke the entire replicaset and arbiters from orbit and rebuild.

PRIMARY:
Metadata: { acquireCount:

{ w: 1 }

}, oplog: { acquireCount:

{ W: 1 }

} } protocol:op_query 155ms
2017-03-25T16:29:36.208+0000 I NETWORK [conn64] end connection 10.196.36.193:20019 (49 connections now open)
2017-03-25T16:29:36.209+0000 I NETWORK [initandlisten] connection accepted from 10.196.36.193:20029 #69 (50 connections now open)
2017-03-25T16:29:36.211+0000 I NETWORK [conn69] end connection 10.196.36.193:20029 (49 connections now open)
2017-03-25T16:29:36.407+0000 I NETWORK [conn55] end connection 10.196.36.193:20003 (48 connections now open)
2017-03-25T16:29:36.408+0000 I NETWORK [conn62] end connection 10.196.36.193:20015 (48 connections now open)
2017-03-25T16:29:36.917+0000 I ASIO [ReplicationExecutor] dropping unhealthy pooled connection to server1.site:33000
2017-03-25T16:29:36.918+0000 I ASIO [ReplicationExecutor] after drop, pool was empty, going to spawn some connections
2017-03-25T16:29:44.935+0000 I REPL [ReplicationExecutor] Error in heartbeat request to server1.site:33000; HostUnreachable: Connection refused
....
2017-03-25T16:29:45.659+0000 I COMMAND [conn13] command admin.$cmd command: isMaster

{ ismaster: 1 }

keyUpdates:0 writeConflicts:0 numYields:0 reslen:397 locks:{} protocol:op_query 521ms
...
SECONDARY
2017-03-25T16:29:28.219+0000 I REPL [ReplicationExecutor] This node is server1.site:33000 in the config
2017-03-25T16:29:28.219+0000 I REPL [ReplicationExecutor] transition to STARTUP2
2017-03-25T16:29:28.219+0000 I REPL [ReplicationExecutor] Starting replication applier threads
2017-03-25T16:29:28.221+0000 I REPL [ReplicationExecutor] transition to RECOVERING
2017-03-25T16:29:28.224+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to arbiter.site:33000
2017-03-25T16:29:28.224+0000 I ASIO [NetworkInterfaceASIO-Replication-0] Successfully connected to server2.site:33000
2017-03-25T16:29:28.225+0000 I REPL [ReplicationExecutor] Member arbiter.site:33000 is now in state ARBITER
2017-03-25T16:29:28.225+0000 I REPL [ReplicationExecutor] Member server2.site:33000 is now in state PRIMARY
2017-03-25T16:29:28.227+0000 I REPL [ReplicationExecutor] transition to SECONDARY
...
2017-03-25T16:29:34.223+0000 I REPL [ReplicationExecutor] syncing from: server2.site:33000
2017-03-25T16:29:34.227+0000 I REPL [SyncSourceFeedback] setting syncSourceFeedback to server2.site:33000
2017-03-25T16:29:34.231+0000 I ASIO [NetworkInterfaceASIO-BGSync-0] Successfully connected to server2.site:33000
2017-03-25T16:29:34.238+0000 I REPL [rsBackgroundSync] Starting rollback due to OplogStartMissing: our last op time fetched: (term: 4, timestamp: Mar 25 16:26:13:21). source's GTE: (term: 5, timestamp: Mar 25 16:29:25:1) hashes: (-7718240682381005025/-2391617886558021771)
2017-03-25T16:29:34.238+0000 I REPL [rsBackgroundSync] beginning rollback
2017-03-25T16:29:34.238+0000 I REPL [rsBackgroundSync] rollback 0
2017-03-25T16:29:34.238+0000 I REPL [ReplicationExecutor] transition to ROLLBACK
2017-03-25T16:29:34.239+0000 I NETWORK [conn4] end connection 10.196.36.203:20550 (5 connections now open)
2017-03-25T16:29:34.239+0000 I REPL [rsBackgroundSync] rollback 1
..
2017-03-25T16:29:34.241+0000 I REPL [rsBackgroundSync] rollback 2 FindCommonPoint
2017-03-25T16:29:34.242+0000 I REPL [rsBackgroundSync] rollback our last optime: Mar 25 16:26:13:21
2017-03-25T16:29:34.242+0000 I REPL [rsBackgroundSync] rollback their last optime: Mar 25 16:29:34:22
2017-03-25T16:29:34.242+0000 I REPL [rsBackgroundSync] rollback diff in end of log times: -201 seconds
2017-03-25T16:29:34.747+0000 I REPL [rsBackgroundSync] rollback 3 fixup
2017-03-25T16:29:34.787+0000 I REPL [rsBackgroundSync] rollback 3.5
2017-03-25T16:29:34.891+0000 I NETWORK [initandlisten] connection accepted from 10.196.36.195:4166 #12 (1 connection now open)
2017-03-25T16:29:34.915+0000 I REPL [rsBackgroundSync] rollback 4 n:1
2017-03-25T16:29:34.915+0000 I REPL [rsBackgroundSync] minvalid=(term: 5, timestamp: Mar 25 16:29:34:33)
2017-03-25T16:29:34.916+0000 I REPL [rsBackgroundSync] rollback 4.1.1 coll resync ConfigData
2017-03-25T16:29:34.945+0000 I INDEX [rsBackgroundSync] build index on: ConfigData properties: { v: 1, key:

{ _id: 1 }

, name: "id", ns: "ConfigData" }
2017-03-25T16:29:34.945+0000 I INDEX [rsBackgroundSync] building index using bulk method
2017-03-25T16:29:34.946+0000 I INDEX [rsBackgroundSync] build index done. scanned 409 total records. 0 sec
....
2017-03-25T16:29:36.155+0000 I INDEX [rsBackgroundSync] build index done. scanned 769 total records. 0 secs
2017-03-25T16:29:36.155+0000 I REPL [rsBackgroundSync] rollback 4.2
2017-03-25T16:29:36.156+0000 I REPL [rsBackgroundSync] minvalid=(term: 5, timestamp: Mar 25 16:29:36:1)
2017-03-25T16:29:36.156+0000 I REPL [rsBackgroundSync] rollback 4.3
2017-03-25T16:29:36.156+0000 I REPL [rsBackgroundSync] rollback 4.6
2017-03-25T16:29:36.156+0000 I REPL [rsBackgroundSync] rollback drop: Data1
2017-03-25T16:29:36.157+0000 I REPL [rsBackgroundSync] rollback drop: Data2
2017-03-25T16:29:36.165+0000 I REPL [rsBackgroundSync] rollback drop: Data3
2017-03-25T16:29:36.179+0000 I REPL [rsBackgroundSync] rollback 4.7
2017-03-25T16:29:36.197+0000 I REPL [rsBackgroundSync] rollback 5 d:4 u:71
2017-03-25T16:29:36.197+0000 I REPL [rsBackgroundSync] rollback 6
2017-03-25T16:29:36.197+0000 I REPL [rsBackgroundSync] rollback done
2017-03-25T16:29:36.197+0000 I REPL [rsBackgroundSync] rollback finished
2017-03-25T16:29:36.197+0000 I NETWORK [SyncSourceFeedback] Socket say send() errno:9 Bad file descriptor 10.196.36.195:33000
2017-03-25T16:29:36.207+0000 I REPL [ReplicationExecutor] transition to RECOVERING
2017-03-25T16:29:36.208+0000 I REPL [ReplicationExecutor] syncing from: server2.site:33000
017-03-25T16:29:36.356+0000 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update: socket exception [SEND_ERROR] for 10.196.36.195:33000
2017-03-25T16:29:36.356+0000 I REPL [SyncSourceFeedback] updateUpstream failed: Location9001: socket exception [SEND_ERROR] for 10.196.36.195:33000, will retry
2017-03-25T16:29:36.365+0000 F REPL [repl writer worker 2] writer worker caught exception: :: caused by :: 26 Failed to apply insert due to missing collection:

{ ts: Timestamp 1490459371000|82, t: 5, h: 7141407996441188723, v: 2, op: "i", <collection data here> }

2017-03-25T16:29:36.365+0000 I - [repl writer worker 2] Fatal Assertion 16360
2017-03-25T16:29:36.365+0000 I - [repl writer worker 2]



 Comments   
Comment by Kelsey Schubert [ 21/Jun/18 ]

Hi wallisr,

Have you observed this issue since upgrading to a more recent version of MongoDB?

Thanks,
Kelsey

Comment by Kelsey Schubert [ 06/Mar/18 ]

Thanks for the update, please let us know whether 3.2.14 resolves the issue.

Comment by Andy Wallis [ 06/Mar/18 ]

It is still an issue. We are seeing if the 3.2.14 will fix the issue.

Comment by Kelsey Schubert [ 06/Mar/18 ]

Hi wallisr,

Sorry this slipped through the cracks. Is this still an issue for you? While we cannot say for sure that this issue has been fixed in more recent versions, there have been a number of improvements and bug fixes in this space that may resolve what you're seeing. Have you been able to upgrade to 3.4 or a later version of 3.2?

Kind regards,
Kelsey

Comment by Andy Wallis [ 02/Nov/17 ]

We still have the logs, but they cannot be posted in an open forum and can only be shared under a NDA.
We can re-create the issue as required.
We cannot upgrade to 3.4 without significant effort. Has this exact issue been fixed in a later 3.2 release?

Comment by Ramon Fernandez Marina [ 02/Nov/17 ]

Sorry to hear you run into issues wallisr. Without the full logs it is difficult to determine exactly what happened. If a secondary is missing a collection the fatal assertion is expected (SERVER-23390).

The behavior you describe is very similar to SERVER-26741, even the scenario is different from yours. In addition, there have been many improvements and bug fixes in rollback behavior since 3.2.3, so my recommendation would be to upgrade to 3.4.10 to benefit from those.

If you've rebuilt the replica set then I'm guessing you no longer have logs and the oplog collection for us to investigate further, is that correct?

Thanks,
Ramón.

Generated at Thu Feb 08 04:28:12 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.