[SERVER-18964] Primary in a replset lost primary status during YCSB 50/50 load Created: 13/Jun/15  Updated: 14/Apr/16  Resolved: 05/Aug/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.0.5, 3.1.4
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Chung-yen Chang Assignee: Benety Goh
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File cluster-config_new_logpath_3.1.4-mongod_logs.tar.gz     File data_and_log_same_partition_3.0.5-mongod_logs.tar.gz     File log_appender_flushing_disabled_3.1.4-mongod_logs.tar.gz     File mongod.log.10.3.1.100     File mongod.log.10.3.1.101     File mongod.log.10.3.1.102    
Backwards Compatibility: Fully Compatible
Operating System: Linux
Steps To Reproduce:

Workload: YCSB workload A (50% update, 50% read), 20 million documents, workload set to run 200 million operation but usually hits this condition way before that (~10 million ops, maybe)
Configuration:
3-node replset, each node is an Amazon m3.2xlarge (8 vCPU, 30 GB memory)

Sprint: RPL 7 08/10/15
Participants:

 Description   

While running a YCSB 50/50 workload on EC2 instances, the connections between the primary and the secondaries timed out and this lead to one of the secondaries taking over as the primary.



 Comments   
Comment by Daniel Pasette (Inactive) [ 03/Aug/15 ]

The cause of the flapping in this particular configuration is that heartbeat commands are blocked by slow IO on the server. The configuration for the mongod nodes has the logpath and dbpath on the same volume, which under the conditions of the stress test, cause threads in the topology coordinator and the heartbeat thread blocking on IO. Because the replica set is configured with uneven priorities (one node at priority=1 and 2 nodes at priority=0.5) and because the primary doesn't actually crash, it just slows down, it causes the replica set to flap. There are two ways alleviate the issue in this situation:

  • move the logpath to a different partition than the dbpath
  • adjust the --slowms setting to a higher threshold (logging more to the already burdened volume just exacerbates the issue)

This has been observed in 3.0.x as well as 3.1.x, but only with WiredTiger. WiredTiger has had a number of issues related to disk IO, especially during checkpoints and especially during workloads like YCSB 50/50 which cause a large proportion of the documents to be written between checkpoints.

Comment by Githook User [ 03/Aug/15 ]

Author:

{u'username': u'benety', u'name': u'Benety Goh', u'email': u'benety@mongodb.com'}

Message: SERVER-18964 cleaned up util/stacktrace.h
Branch: master
https://github.com/mongodb/mongo/commit/0cc69a3004d94834890fb96dca4d3835669aadf2

Comment by Chung-yen Chang [ 09/Jul/15 ]

I reran the same test against a recent mongod build (git version: 87201c5db22901f103dc5e1de1dc3a67aab36257). I am doing a very long run which has now run more than 20 hours and this only happened once. I used to see multiple occurrences of this issue in tests that were 2-3 hours long so this is definitely an improvement. The various improvement made to wiredTiger must have helped here.

Comment by Andy Schwerin [ 15/Jun/15 ]

It appears that heartbeats are being delayed during WiredTiger stalls. It would be good to chase down the link between storage stalls and heartbeats, to see if it can be eliminated, even if we want to get rid of the stalls in general.

Comment by Chung-yen Chang [ 15/Jun/15 ]

I was able to reproduce the same issue with just a straight replset (without sharding) and collected the mongd logs from all three nodes. However, the log from primary node is too large to upload to jira even after I deleted all the lines after the first occurrence of the problem. The sequence of events in the log around the takeover looks very similar. The secondaries were showing very regular connections end/established until a socket error, followed by election. The primary noticed that another primary has been elected and gave up.

From the secondary (10.3.1.102) that took over.

2015-06-13T11:53:16.256+0000 I NETWORK  [conn179] end connection 10.3.1.101:37671 (3 connections now open)
2015-06-13T11:53:16.256+0000 I NETWORK  [initandlisten] connection accepted from 10.3.1.101:37673 #181 (4 connections now open)
2015-06-13T11:53:16.287+0000 I NETWORK  [conn180] end connection 10.3.1.100:60780 (3 connections now open)
2015-06-13T11:53:16.288+0000 I NETWORK  [initandlisten] connection accepted from 10.3.1.100:60782 #182 (4 connections now open)
2015-06-13T11:53:30.765+0000 I NETWORK  [ReplExecNetThread-6] Socket recv() timeout  10.3.1.100:27017
2015-06-13T11:53:30.766+0000 I NETWORK  [ReplExecNetThread-6] SocketException: remote: 10.3.1.100:27017 error: 9001 socket exception [RECV_TIMEOUT] serv\
er [10.3.1.100:27017]
2015-06-13T11:53:30.791+0000 I REPL     [ReplicationExecutor] Error in heartbeat request to ip-10-3-1-100:27017; HostUnreachable network error while attempting to run command 'replSetHeartbeat' on host 'ip-10-3-1-100:27017' with arguments '{ replSetHeartbeat: "rs0", pv: 1, v: 5, from: "10.3.1.102:27017", fromId: 2, checkEmpty: false }' and metadata '{}'
2015-06-13T11:53:30.792+0000 I REPL     [ReplicationExecutor] Standing for election
2015-06-13T11:53:30.797+0000 I REPL     [ReplicationExecutor] possible election tie; sleeping 445ms until 2015-06-13T11:53:31.242+0000
2015-06-13T11:53:31.242+0000 I REPL     [ReplicationExecutor] Standing for election
2015-06-13T11:53:31.242+0000 I REPL     [ReplicationExecutor] running for election
2015-06-13T11:53:31.245+0000 I REPL     [ReplicationExecutor] received vote: 1 votes from 10.3.1.101:27017
2015-06-13T11:53:31.245+0000 I REPL     [ReplicationExecutor] election succeeded, assuming primary role
2015-06-13T11:53:31.245+0000 I REPL     [ReplicationExecutor] transition to PRIMARY
2015-06-13T11:53:32.308+0000 I REPL     [rsSync] transition to primary complete; database writes are now permitted
2015-06-13T11:53:32.792+0000 I REPL     [ReplicationExecutor] Member ip-10-3-1-100:27017 is now in state SECONDARY

On the original primary (10.3.1.100)

2015-06-13T11:53:20.765+0000 I NETWORK  [conn324] end connection 10.3.1.102:46997 (72 connections now open)
2015-06-13T11:53:32.293+0000 I REPL     [ReplicationExecutor] Member 10.3.1.102:27017 is now in state PRIMARY

One way to get the log file size down to manageable is by skipping all the lines about slow response (most from not able to contact the primary) and keep just the lines about network and repl. If that is useful, let me know and I will do prune the log from the primary and upload that here.

Comment by Chung-yen Chang [ 13/Jun/15 ]

Three mongod files from a cluster run are attached. These are the mongod logs from one of the two shards in the cluster. The file ended with 10.3.1.100 is the log from the original primary node. Be warned that the log from the primary is rather large. I will gather and upload another set of files from a straight replset.

In the file ended with 101, there was a clear pattern of connections established and ended between the nodes every 30 seconds, then it hit a socket timeout which was followed by an election.

2015-06-13T01:38:53.303+0000 I NETWORK  [conn145] end connection 10.3.1.102:33961 (5 connections now open)
2015-06-13T01:38:53.303+0000 I NETWORK  [initandlisten] connection accepted from 10.3.1.102:33963 #147 (7 connections now open)
2015-06-13T01:38:53.329+0000 I NETWORK  [conn146] end connection 10.3.1.100:58583 (5 connections now open)
2015-06-13T01:38:53.598+0000 I NETWORK  [initandlisten] connection accepted from 10.3.1.100:58585 #148 (7 connections now open)
2015-06-13T01:39:23.609+0000 I NETWORK  [conn148] end connection 10.3.1.100:58585 (5 connections now open)
2015-06-13T01:39:23.609+0000 I NETWORK  [initandlisten] connection accepted from 10.3.1.100:58587 #149 (6 connections now open)
2015-06-13T01:39:23.610+0000 I NETWORK  [conn147] end connection 10.3.1.102:33963 (5 connections now open)
2015-06-13T01:39:23.610+0000 I NETWORK  [initandlisten] connection accepted from 10.3.1.102:33965 #150 (6 connections now open)
2015-06-13T01:39:32.646+0000 I NETWORK  [ReplExecNetThread-4] Socket recv() timeout  10.3.1.100:27017
2015-06-13T01:39:32.646+0000 I NETWORK  [ReplExecNetThread-4] SocketException: remote: 10.3.1.100:27017 error: 9001 socket exception [RECV_TIMEOUT] server [10\
.3.1.100:27017]
2015-06-13T01:39:32.653+0000 I REPL     [ReplicationExecutor] Error in heartbeat request to ip-10-3-1-100:27017; HostUnreachable network error while attemptin\
g to run command 'replSetHeartbeat' on host 'ip-10-3-1-100:27017' with arguments '{ replSetHeartbeat: "rs0", pv: 1, v: 5, from: "10.3.1.101:27017", fromId: 1,\
 checkEmpty: false }' and metadata '{}'
2015-06-13T01:39:32.653+0000 I REPL     [ReplicationExecutor] Standing for election
2015-06-13T01:39:32.656+0000 I REPL     [ReplicationExecutor] running for election
2015-06-13T01:39:32.657+0000 I REPL     [ReplicationExecutor] received vote: 1 votes from 10.3.1.102:27017
2015-06-13T01:39:32.657+0000 I REPL     [ReplicationExecutor] election succeeded, assuming primary role
2015-06-13T01:39:32.657+0000 I REPL     [ReplicationExecutor] transition to PRIMARY
2015-06-13T01:39:37.592+0000 I REPL     [rsSync] transition to primary complete; database writes are now permitted
2015-06-13T01:39:38.261+0000 I REPL     [ReplicationExecutor] Member ip-10-3-1-100:27017 is now in state PRIMARY
2015-06-13T01:39:38.261+0000 I REPL     [ReplicationExecutor] another primary seen with election time 557b81b1:2 my election time is 557b89d4:1
2015-06-13T01:39:38.261+0000 I REPL     [ReplicationExecutor] another PRIMARY detected and it should step down since it was elected earlier than me
2015-06-13T01:39:38.261+0000 I REPL     [ReplicationExecutor] Requesting ip-10-3-1-100:27017 step down from primary

File *102 showed similar pattern except that it did not get elected as primary.

The primary (100) saw another primary a few seconds later but before showing any socket error.

2015-06-13T01:39:38.253+0000 I REPL     [ReplicationExecutor] another primary seen with election time 557b89d4:1 my election time is 557b81b1:2
.....
2015-06-13T01:39:38.479+0000 I REPL     [replExecDBWorker-2] transition to SECONDARY
.....2015-06-13T01:39:38.493+0000 I NETWORK  [conn80] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server        [10.3.1.99:60955]

The server status from these nodes are also collected but nothing jumped out at me. One observation that might be relevant is that the primary node CPU usage fluctuated a lot and was very high at times. I did not have that collected in a file but was monitoring the system during the run and it often reaches to 700+% on an 8 vCPU instance. The network seems to be fine because there was a couple other ssh sessions connecting to this system for data collection and the data seemed to have been collected fine. Somehow the secondaries just wasn't able to reach the primary and decided to have a coup. I will gather another set of logs for the 3.1.4 replset and attach them too.

Generated at Thu Feb 08 03:49:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.