Major - P3
I have been running some replication tests, and found something quite puzzling regarding failovers.
Looking at packet capture off the bridge interface (attached) and log files (snippets attached): After failure on network connection of Node 1 (which was master, last successful communications around packet 2414), Node 2 became master (2566 seems to be the first heartbeat with Node 2 as master). The client eventually timed out as expected. When the client attempted to perform the next update on Node 2 (packets 3271 and 3273), the write timed out. It appears that Node 2 thinks it is master, but instead of Node 3 replicating from Node 2, Node 2 is instead asking for replication updated from Node 3 (e.g. 3288). I think this is a bug.
The replication setup is a standard 3-node replica set. The nodes run in separate VirtualBox VMs with bridged networking. The network interfaces are connected to TAP interfaces on the host and connected together using a bridge device. In the test in question, a single client running on the host updates numbers in collection 'test' on the cluster one at a time. A failure is introduced by killing network connection of one of the VMs by setting packet loss on its TAP interface to 100% with netem.
Snippet from mongodb.log of Node 2 when it became master:
Wed Jul 4 16:44:37 [rsHealthPoll] DBClientCursor::init call() failed
Wed Jul 4 16:44:37 [rsHealthPoll] replSet info test1:27017 is down (or slow to respond): DBClientBase::findN: transport error: test1:27017 query:
Wed Jul 4 16:44:37 [rsHealthPoll] replSet member test1:27017 is now in state DOWN
Wed Jul 4 16:44:37 [rsMgr] replSet info electSelf 2
Wed Jul 4 16:44:37 [rsMgr] replSet PRIMARY
Snippet from mongodb.log of Node 3 when it formed new cluster with Node 2:
Wed Jul 4 16:44:37 [conn190] replSet info voting yea for test2:27017 (2)
Wed Jul 4 16:44:37 [rsHealthPoll] couldn't connect to test1:27017: couldn't connect to server test1:27017
Wed Jul 4 16:44:38 [rsHealthPoll] replSet member test2:27017 is now in state PRIMARY
Wed Jul 4 16:44:38 [rsMgr] replSet info two primaries (transiently)
Wed Jul 4 16:44:44 [rsMgr] replSet info two primaries (transiently)
Wed Jul 4 16:44:47 [rsHealthPoll] replSet info test1:27017 is down (or slow to respond): socket exception
Wed Jul 4 16:44:47 [rsHealthPoll] replSet member test1:27017 is now in state DOWN