[SERVER-18964] Primary in a replset lost primary status during YCSB 50/50 load Created: 13/Jun/15 Updated: 14/Apr/16 Resolved: 05/Aug/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 3.0.5, 3.1.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Chung-yen Chang | Assignee: | Benety Goh |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Backwards Compatibility: | Fully Compatible |
| Operating System: | Linux |
| Steps To Reproduce: | Workload: YCSB workload A (50% update, 50% read), 20 million documents, workload set to run 200 million operation but usually hits this condition way before that (~10 million ops, maybe) |
| Sprint: | RPL 7 08/10/15 |
| Participants: |
| Description |
|
While running a YCSB 50/50 workload on EC2 instances, the connections between the primary and the secondaries timed out and this lead to one of the secondaries taking over as the primary. |
| Comments |
| Comment by Daniel Pasette (Inactive) [ 03/Aug/15 ] | ||||||||||||||||||||||||||||
|
The cause of the flapping in this particular configuration is that heartbeat commands are blocked by slow IO on the server. The configuration for the mongod nodes has the logpath and dbpath on the same volume, which under the conditions of the stress test, cause threads in the topology coordinator and the heartbeat thread blocking on IO. Because the replica set is configured with uneven priorities (one node at priority=1 and 2 nodes at priority=0.5) and because the primary doesn't actually crash, it just slows down, it causes the replica set to flap. There are two ways alleviate the issue in this situation:
This has been observed in 3.0.x as well as 3.1.x, but only with WiredTiger. WiredTiger has had a number of issues related to disk IO, especially during checkpoints and especially during workloads like YCSB 50/50 which cause a large proportion of the documents to be written between checkpoints. | ||||||||||||||||||||||||||||
| Comment by Githook User [ 03/Aug/15 ] | ||||||||||||||||||||||||||||
|
Author: {u'username': u'benety', u'name': u'Benety Goh', u'email': u'benety@mongodb.com'}Message: | ||||||||||||||||||||||||||||
| Comment by Chung-yen Chang [ 09/Jul/15 ] | ||||||||||||||||||||||||||||
|
I reran the same test against a recent mongod build (git version: 87201c5db22901f103dc5e1de1dc3a67aab36257). I am doing a very long run which has now run more than 20 hours and this only happened once. I used to see multiple occurrences of this issue in tests that were 2-3 hours long so this is definitely an improvement. The various improvement made to wiredTiger must have helped here. | ||||||||||||||||||||||||||||
| Comment by Andy Schwerin [ 15/Jun/15 ] | ||||||||||||||||||||||||||||
|
It appears that heartbeats are being delayed during WiredTiger stalls. It would be good to chase down the link between storage stalls and heartbeats, to see if it can be eliminated, even if we want to get rid of the stalls in general. | ||||||||||||||||||||||||||||
| Comment by Chung-yen Chang [ 15/Jun/15 ] | ||||||||||||||||||||||||||||
|
I was able to reproduce the same issue with just a straight replset (without sharding) and collected the mongd logs from all three nodes. However, the log from primary node is too large to upload to jira even after I deleted all the lines after the first occurrence of the problem. The sequence of events in the log around the takeover looks very similar. The secondaries were showing very regular connections end/established until a socket error, followed by election. The primary noticed that another primary has been elected and gave up. From the secondary (10.3.1.102) that took over.
On the original primary (10.3.1.100)
One way to get the log file size down to manageable is by skipping all the lines about slow response (most from not able to contact the primary) and keep just the lines about network and repl. If that is useful, let me know and I will do prune the log from the primary and upload that here. | ||||||||||||||||||||||||||||
| Comment by Chung-yen Chang [ 13/Jun/15 ] | ||||||||||||||||||||||||||||
|
Three mongod files from a cluster run are attached. These are the mongod logs from one of the two shards in the cluster. The file ended with 10.3.1.100 is the log from the original primary node. Be warned that the log from the primary is rather large. I will gather and upload another set of files from a straight replset. In the file ended with 101, there was a clear pattern of connections established and ended between the nodes every 30 seconds, then it hit a socket timeout which was followed by an election.
File *102 showed similar pattern except that it did not get elected as primary. The primary (100) saw another primary a few seconds later but before showing any socket error.
The server status from these nodes are also collected but nothing jumped out at me. One observation that might be relevant is that the primary node CPU usage fluctuated a lot and was very high at times. I did not have that collected in a file but was monitoring the system during the run and it often reaches to 700+% on an 8 vCPU instance. The network seems to be fine because there was a couple other ssh sessions connecting to this system for data collection and the data seemed to have been collected fine. Somehow the secondaries just wasn't able to reach the primary and decided to have a coup. I will gather another set of logs for the 3.1.4 replset and attach them too. |