[SERVER-23003] Recovery problems after network partition. Created: 08/Mar/16  Updated: 28/Aug/18  Resolved: 11/Mar/16

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.2.3
Fix Version/s: 3.2.5, 3.3.3

Type: Bug Priority: Major - P3
Reporter: Róbert Heged?s Assignee: Scott Hernandez (Inactive)
Resolution: Done Votes: 0
Labels: code-only
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File dc1-node1.log     File dc1-node2.log     File dc2-node1.log     File dc2-node2.log     File dc3-node1.log     File rs.conf    
Issue Links:
Related
is related to SERVER-22130 Reset applier lastAppliedOptime after... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Completed:
Steps To Reproduce:

Setup:

  • Replica set with 5 nodes (4 data node and 1 arbiter) in 3 datacenter.
  • DC1: primary, secondary
  • DC2: secondary, secondary
  • DC3: arbiter

Disable network between DC1 and DC2.
The primary is switching back and forth between DC1 and DC2.
Enable network between DC1 and DC2.
Some of the nodes crashes with the error message shown above.

Sprint: Repl 11 (03/11/16), Repl 12 (04/01/16)
Participants:
Case:

 Description   

2016-03-07T09:09:29.127-0500 I REPL     [rsBackgroundSync] rollback 6
2016-03-07T09:09:29.127-0500 I COMMAND  [rsBackgroundSync] query admin.system.roles planSummary: EOF ntoreturn:0 ntoskip:0 keysExamined:0 docsExamined:0 cursorExhausted:1 keyUpdates:0 writeConflicts:0 numYields:0 nreturned:0 reslen:20 locks:{ Global: { acquireCount: { r: 11, w: 2, W: 1 } }, MMAPV1Journal: { acquireCount: { r: 3, w: 5 } }, Database: { acquireCount: { r: 4, w: 1, W: 1 } }, Collection: { acquireCount: { R: 3 } }, oplog: { acquireCount: { R: 1, W: 1 } } } 0ms
2016-03-07T09:09:29.128-0500 I REPL     [rsBackgroundSync] rollback done
2016-03-07T09:09:29.128-0500 I NETWORK  [SyncSourceFeedback] Socket say send() errno:9 Bad file descriptor 192.168.200.4:27017
2016-03-07T09:09:29.128-0500 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update: socket exception [SEND_ERROR] for 192.168.200.4:27017
2016-03-07T09:09:29.128-0500 I REPL     [ReplicationExecutor] transition to RECOVERING
2016-03-07T09:09:29.128-0500 I REPL     [rsBackgroundSync] rollback finished
2016-03-07T09:09:29.128-0500 I REPL     [SyncSourceFeedback] updateUpstream failed: Location9001: socket exception [SEND_ERROR] for 192.168.200.4:27017, will retry
2016-03-07T09:09:29.128-0500 I REPL     [ReplicationExecutor] syncing from: 192.168.200.2:27017
2016-03-07T09:09:29.129-0500 D COMMAND  [ConnectBG] BackgroundJob starting: ConnectBG
2016-03-07T09:09:29.131-0500 I REPL     [SyncSourceFeedback] setting syncSourceFeedback to 192.168.200.2:27017
2016-03-07T09:09:29.132-0500 D COMMAND  [ConnectBG] BackgroundJob starting: ConnectBG
2016-03-07T09:09:29.133-0500 I ASIO     [NetworkInterfaceASIO-BGSync-0] Successfully connected to 192.168.200.2:27017
2016-03-07T09:09:29.135-0500 I -        [rsSync] Fatal assertion 34361 OplogOutOfOrder: Attempted to apply an earlier oplog entry (ts: Mar  7 09:09:00:2) when our lastWrittenOptime was (term: 3, timestamp: Mar  7 09:09:11:2)
2016-03-07T09:09:29.143-0500 I CONTROL  [rsSync] 
 0x12f14b2 0x128f788 0x127c722 0xf2a364 0xf1ff80 0x7fd3fcdcca40 0x7fd3fc5e9182 0x7fd3fc31647d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"EF14B2","s":"_ZN5mongo15printStackTraceERSo"},{"b":"400000","o":"E8F788","s":"_ZN5mongo10logContextEPKc"},{"b":"400000","o":"E7C722","s":"_ZN5mongo23fassertFailedWithStatusEiRKNS_6StatusE"},{"b":"400000","o":"B2A364","s":"_ZN5mongo4repl8SyncTail16oplogApplicationEv"},{"b":"400000","o":"B1FF80","s":"_ZN5mongo4repl13runSyncThreadEv"},{"b":"7FD3FCD1B000","o":"B1A40"},{"b":"7FD3FC5E1000","o":"8182"},{"b":"7FD3FC21C000","o":"FA47D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.2.3", "gitVersion" : "b326ba837cf6f49d65c2f85e1b70f6f31ece7937", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "3.19.0-25-generic", "version" : "#26~14.04.1-Ubuntu SMP Fri Jul 24 21:16:20 UTC 2015", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "C1CD0F405485844DA016C6B5275C8BEF3D68DB7A" }, { "b" : "7FFE8D1C5000", "elfType" : 3, "buildId" : "01005E92EA3C0C1527396B65C7412468298D0813" }, { "b" : "7FD3FD806000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "D08DD65F97859C71BB2CBBF1043BD968EFE18AAD" }, { "b" : "7FD3FD42B000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "F86FA9FB4ECEB4E06B40DBDF761A4172B70A4229" }, { "b" : "7FD3FD223000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "92FCF41EFE012D6186E31A59AD05BDBB487769AB" }, { "b" : "7FD3FD01F000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "C1AE4CB7195D337A77A3C689051DABAA3980CA0C" }, { "b" : "7FD3FCD1B000", "path" : "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", "elfType" : 3, "buildId" : "4BF6F7ADD8244AD86008E6BF40D90F8873892197" }, { "b" : "7FD3FCA15000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "1D76B71E905CB867B27CEF230FCB20F01A3178F5" }, { "b" : "7FD3FC7FF000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "8D0AA71411580EE6C08809695C3984769F25725B" }, { "b" : "7FD3FC5E1000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "9318E8AF0BFBE444731BB0461202EF57F7C39542" }, { "b" : "7FD3FC21C000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "30C94DC66A1FE95180C3D68D2B89E576D5AE213C" }, { "b" : "7FD3FDA65000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "9F00581AB3C73E3AEA35995A0C50D24D59A01D47" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x32) [0x12f14b2]
 mongod(_ZN5mongo10logContextEPKc+0x138) [0x128f788]
 mongod(_ZN5mongo23fassertFailedWithStatusEiRKNS_6StatusE+0x62) [0x127c722]
 mongod(_ZN5mongo4repl8SyncTail16oplogApplicationEv+0x1564) [0xf2a364]
 mongod(_ZN5mongo4repl13runSyncThreadEv+0x230) [0xf1ff80]
 libstdc++.so.6(+0xB1A40) [0x7fd3fcdcca40]
 libpthread.so.0(+0x8182) [0x7fd3fc5e9182]
 libc.so.6(clone+0x6D) [0x7fd3fc31647d]
-----  END BACKTRACE  -----
2016-03-07T09:09:29.143-0500 I -        [rsSync] 
 
***aborting after fassert() failure



 Comments   
Comment by Eric Milkie [ 11/Mar/16 ]

We found and fixed the source of the issue. The fix will be available in version 3.2.5, which will be scheduled for release in the near future. Thank you for reporting this!

Comment by Githook User [ 11/Mar/16 ]

Author:

{u'name': u'Scott Hernandez', u'email': u'scotthernandez@tart.local'}

Message: SERVER-23003: Wait for synctail to clear the buffer.

(cherry picked from commit 7a8fae34f9949fadc515c47d956bbbc464842664)
Branch: v3.2
https://github.com/mongodb/mongo/commit/4834c6b6965281b5f861e8df398ad285e3352e16

Comment by Githook User [ 11/Mar/16 ]

Author:

{u'name': u'Scott Hernandez', u'email': u'scotthernandez@tart.local'}

Message: SERVER-23003: Wait for synctail to clear the buffer.
Branch: master
https://github.com/mongodb/mongo/commit/7a8fae34f9949fadc515c47d956bbbc464842664

Comment by Róbert Heged?s [ 10/Mar/16 ]

Good, thanks!

Comment by Eric Milkie [ 10/Mar/16 ]

I've successfully reproduced the issue here, and am working on a fix.

Comment by Eric Milkie [ 09/Mar/16 ]

Ah, yes, that should be enough for me in the logs you already gave me. Sorry I didn't see that before!
-Eric

Comment by Róbert Heged?s [ 09/Mar/16 ]

Hi Eric,
If you look at the end of dc2-node1.log and dc2-node2.log, there you can see that the same problem occured again, maybe these logs are enough for you to recreate the problem? Or could i attach the logs also from earlier when the problem occured before?
Róbert

Comment by Eric Milkie [ 09/Mar/16 ]

Hi Róbert,
I looked at the logs and it seems that all nodes were restarted today after midnight, and that is where the logs begin. Do you by chance happen to have the logs from the servers for the period of time when the problem occurred? That is, for March 7, 09:09, and prior to that? With those logs, I can form a recreation of what went wrong. Thanks!
-Eric

Comment by Róbert Heged?s [ 09/Mar/16 ]

Hi Eric!
Here are the full logs from every node in the replica set and the "rs.conf()" output.
Kind regards,
Róbert

Comment by Eric Milkie [ 08/Mar/16 ]

Hi Róbert,
I'm sorry you experienced this problem. If possible, could you attach the full logs from every node in the replica set?
Also, please attach the output of "rs.conf()" when run in the MongoDB shell attached to one of the nodes.
Thanks,
Eric

Generated at Thu Feb 08 04:02:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.