[SERVER-17649] Secondary oplog replication inserts stall/slow under heavy insert load using WT Created: 18/Mar/15 Updated: 04/Jun/15 Resolved: 17/Apr/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, WiredTiger |
| Affects Version/s: | 3.0.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Oleg Rekutin | Assignee: | Alexander Gorrod |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Backwards Compatibility: | Fully Compatible | ||||
| Operating System: | ALL | ||||
| Participants: | |||||
| Description |
|
We noticed that one of our 4 shards had a secondary with high replication lag. Replica shard log has periodic very slow serverStatus with large values for "after oplog": 2015-03-18T18:30:44.539+0000 I COMMAND [conn82] serverStatus was very slow: { after basic: 0, after asserts: 0, after connections: 0, after cursors: 0, after extra_info: 0, after globalLock: 0, after locks: 0, after network: 0, after opcounters: 0, after opcountersRepl: 0, after oplog: 72189, after repl: 72189, after storageEngine: 72189, after wiredTiger: 72189, at end: 72189 }Restarted the replica shard hoping that this was a fluke, but the shard quickly went back to a low insertion rate. Running db.getSiblingDB("local").oplog.rs.stats() produces:
|
| Comments |
| Comment by Alexander Gorrod [ 10/Apr/15 ] | ||||||||||
|
Thanks for the update, and for being so helpful providing information - I'll chase it some more based on the information you already provided, but will close it I can't find anything obvious. We've got a couple of other cases that could be related that either have fixes or reproducible workloads. | ||||||||||
| Comment by Oleg Rekutin [ 09/Apr/15 ] | ||||||||||
|
Alexander Gorrod, this has stopped being an issue for us at the moment, as our need to do the heavy insert loads has been abated (it was a one-time need, the cluster experiences moderate load normally). Unfortunately, I have not had a chance to replicate this scenario in a test environment where I could send similar load, and absent the original load pressure, I am not seeing this issue in day-to-day. I apologize that I cannot get you the extra system information at this time. If you have ideas for other things I can check for when this happens again, let me know. I hate to leave issues hanging like this--if it helps, we can close right now, and I can reopen once there is an opportunity for me to instrument/collect stats for this situation. | ||||||||||
| Comment by Alexander Gorrod [ 09/Apr/15 ] | ||||||||||
|
Hi oleg@evergage.com, is this still an issue for you? Is it difficult for you to gather the system information I requested? If so I can think about other things that might help us resolve the problem. Thanks, | ||||||||||
| Comment by Alexander Gorrod [ 24/Mar/15 ] | ||||||||||
|
I've attached a tool to this ticket, gdbmon.py, that will use gdb to collect stack trace samples at regular intervals; this should give us a clearer picture of what's going on inside mongod. Please be aware that using this tool will have some performance (and potentially even functional) impact on the mongod in question; but since it is not performing well as it is, this may be a risk you are willing to accept. If you are willing to do so, please do the following:
| ||||||||||
| Comment by Oleg Rekutin [ 24/Mar/15 ] | ||||||||||
|
The oplog is 120000MB. The collections being written to have three indices each, including the _id index. I'm not sure what you mean by "tables". There were four heavy write concurrent workloads to four different databases (one collection in each), alongside with a small amount of consistent bursty insertions to 10-15 collections, and then finally small sporadic writes to another ~25-100 collections or so. Each collection is in its own database in this situation. My theory is that it at this increased throughput, the server is hitting starvation of some kind that's causing it to not write out the replication log at full speed. At the time of the problem, iostat -x 5 showed IO utilization on the node at 49%, 81%, 23% respectively after 15 seconds. So it look like we weren't reaching the full IO limit. Also, analyzing CloudWatch monitoring stats, I see that the node was running around at about 800-900 IOPs (about 30% of the total IOPS available to the underlying EBS SSD storage). Anyways, I would normally expect to see IO saturation behavior to "flatten out" and cause a smooth growth in replication lag. What I'm seeing where from MMS charts is that when nodes get into trouble, they have "spiky" replication insert rates. I realized I made a mistake earlier: the underlying storage is EBS GP2 SSD volumes, NOT ephemeral SSDs. The guaranteed throughput is 3000 IOPS, which supports up to 48Mbps, for these volumes. | ||||||||||
| Comment by Alexander Gorrod [ 24/Mar/15 ] | ||||||||||
|
Thanks for the additional information. I've been looking through the logs and I can see some oddities, but nothing that obviously uncovers the issue. I'll keep digging and let you know when I've got more information. A couple of questions:
Thanks, | ||||||||||
| Comment by Oleg Rekutin [ 23/Mar/15 ] | ||||||||||
|
Alexander, you said: > Does this mean that you expect the shard that is exhibiting performance problems to handle greater load than the one that is operating as expected? i.e: Is the cause here that there is a greater load on that particular replica? I think that high load is related, but I doubt it's because it's relatively greater on that particular replica. Also, the problem replica has moved around depending on the day (so it's not constrained to happening only on one shard replica). I've seen this come up when the load was identical across shards, where there were shards that were OK despite same load. > Are there other differences between the good/bad nodes? There are no differences in setup. They are running on identical AWS/EC2 hardware, provisioned identically using Chef. I haven't seen other behavior differences. >Can you send information about the replica that is exhibiting poor performance: Problem replica:
Good node:
It looks like there is no paging. I looked at the page fault chart and it is attached, the highest value was 0.4 for the problem node. The workload is almost entirely insertions. > What operating system are you running on? Amazon Linux 64-bit: 3.4.73-64.112.amzn1.x86_64 #1 SMP. > What is the disk subsystem like? (SSD, spinning disk, etc) Ephemeral SSD instance storage. | ||||||||||
| Comment by Oleg Rekutin [ 23/Mar/15 ] | ||||||||||
|
Attached are also MMS charts for this time period (on the chart is 12:28-12:44 covering the ss.log files). You can see in the first chart the insertion loads are pretty similar across the shards. In the second chard, repl_insert_chart, you can see a very jagged pink line, that's the bad shard replica node. Its repl insert processing rate is all over the place. | ||||||||||
| Comment by Oleg Rekutin [ 23/Mar/15 ] | ||||||||||
|
Attached are about 15 minutes worth of logs. ss-bad is a node that was struggling and was not reducing its replication lag. shard-bad-repl-drain.log is the log from the server for that time period. You can see the slow server status calls. | ||||||||||
| Comment by Alexander Gorrod [ 23/Mar/15 ] | ||||||||||
|
Could you please provide more information. You mention: > This "good shard" is experiencing ~7K insert operations vs ~9K inserts on the "problem shard." Does this mean that you expect the shard that is exhibiting performance problems to handle greater load than the one that is operating as expected? i.e: Is the cause here that there is a greater load on that particular replica? Are there other differences between the good/bad nodes? Can you send information about the replica that is exhibiting poor performance:
It would be very helpful if you can gather statistics over a period of time (for both a good and bad replica). To do that you can:
Thanks, | ||||||||||
| Comment by Oleg Rekutin [ 20/Mar/15 ] | ||||||||||
|
ramon.fernandez, I don't have access to re-open this issue, should I reach out to someone out of band to re-open this? | ||||||||||
| Comment by Oleg Rekutin [ 20/Mar/15 ] | ||||||||||
|
Attached are the oplog stats for first node with the problem and a good node. The biggest difference I see is this, on a bad node:
Good node:
Notice that the bad node has an order of magnitude smaller "pages deleted" and "page reconciliation calls" numbers. The other bad node (that's not as bad), I just checked, has these numbers at about half of that of the good node. Could it be that the oplog is not being consistently drained or the draining of it is starved somehow? | ||||||||||
| Comment by Oleg Rekutin [ 20/Mar/15 ] | ||||||||||
|
Comparing server statuses between first bad node and good node, I am seeing the following major differences (bad node vs good node): reconciliation."split bytes currently awaiting free" : 1696932 vs 9896 repl.buffer is drastically different, bad node:
good node:
| ||||||||||
| Comment by Oleg Rekutin [ 20/Mar/15 ] | ||||||||||
|
Okay, looks like I am seeing this problem come back. After a day of running on 3.0.1, I saw one of the nodes start building replication lag to 300-400 seconds whereas other nodes peaked at maybe 10 seconds. Then another node encountered the problem, just not as severely. Attaching 3 outputs of db.serverStatus(), two from nodes with problems and one node without problems. | ||||||||||
| Comment by Ramon Fernandez Marina [ 19/Mar/15 ] | ||||||||||
|
Thanks for the update oleg@evergage.com. I'm resolving this ticket now, but please feel free to reopen and provide db.serverStatus() info if the issue happens again. Regards, | ||||||||||
| Comment by Oleg Rekutin [ 19/Mar/15 ] | ||||||||||
|
I just upgraded the cluster to 3.0.1 due to a critical bug, so I am now running a 15% heavier insert workload and it is stable/coping so far. I will try to ramp up the load before declaring this fixed, or keeping an eye on it. I will definitely capture db.serverStatus next time! Thank you. | ||||||||||
| Comment by Alexander Gorrod [ 19/Mar/15 ] | ||||||||||
|
Could you also upload the output from db.serverStatus from the replica shard that is having trouble? | ||||||||||
| Comment by Oleg Rekutin [ 18/Mar/15 ] | ||||||||||
|
additional_oplog_stats.json is the second run of oplog stats from the problem node, run approx 7-10 minutes after the first one in this issue. | ||||||||||
| Comment by Oleg Rekutin [ 18/Mar/15 ] | ||||||||||
|
Second screenshot shows a replica shard in the system that does NOT have a problem and is showing a fairly consistent replica inserts rate. This "good shard" is experiencing ~7K insert operations vs ~9K inserts on the "problem shard." | ||||||||||
| Comment by Oleg Rekutin [ 18/Mar/15 ] | ||||||||||
|
First screenshot shows problem replica node, showing start of replication lag growth and uneven repl: insert operation counters. You can see where after the restart, it kept up for a bit, but then fell behind again. |