[SERVER-17649] Secondary oplog replication inserts stall/slow under heavy insert load using WT Created: 18/Mar/15  Updated: 04/Jun/15  Resolved: 17/Apr/15

Status: Closed
Project: Core Server
Component/s: Replication, WiredTiger
Affects Version/s: 3.0.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Oleg Rekutin Assignee: Alexander Gorrod
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2015-03-18 at 2.38.22 PM.png     PNG File Screen Shot 2015-03-18 at 2.42.00 PM.png     File additional_oplog_stats.json     File gdbmon.py     PNG File insert_chart.png     File node2withproblem-serverStatus.json     File nodewithoutproblem-oplogstats.json     File nodewithoutproblem-serverStatus.json     File nodewithproblem-oplogstats.json     File nodewithproblem-serverStatus.json     PNG File repl_insert_chart.png     Text File shard-bad-repl-drain.log     File ss-bad.log.gz     File ss-good.log.gz    
Issue Links:
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

We noticed that one of our 4 shards had a secondary with high replication lag.

Replica shard log has periodic very slow serverStatus with large values for "after oplog":

2015-03-18T18:30:44.539+0000 I COMMAND [conn82] serverStatus was very slow:

{ after basic: 0, after asserts: 0, after connections: 0, after cursors: 0, after extra_info: 0, after globalLock: 0, after locks: 0, after network: 0, after opcounters: 0, after opcountersRepl: 0, after oplog: 72189, after repl: 72189, after storageEngine: 72189, after wiredTiger: 72189, at end: 72189 }

Restarted the replica shard hoping that this was a fluke, but the shard quickly went back to a low insertion rate.

Running db.getSiblingDB("local").oplog.rs.stats() produces:

{
	"ns" : "local.oplog.rs",
	"count" : 174896077,
	"size" : 125833992359,
	"avgObjSize" : 719,
	"storageSize" : 62140203008,
	"capped" : true,
	"max" : -1,
	"maxSize" : NumberLong("125829120000"),
	"wiredTiger" : {
		"metadata" : {
			"formatVersion" : 1,
			"oplogKeyExtractionVersion" : 1
		},
		"creationString" : "allocation_size=4KB,app_metadata=(formatVersion=1,oplogKeyExtractionVersion=1),block_allocation=best,block_compressor=snappy,cache_resident=0,checkpoint=(WiredTigerCheckpoint.23998=(addr=\"01e30ad2de81e436162abee3ddf13ebde46538e310e3ddf17bb3e43c3b848d808080e50e77d81fc0e50bad737fc0\",order=23998,time=1426703669,size=50154676224,write_gen=20018662)),checkpoint_lsn=(7514,101134976),checksum=uncompressed,collator=,columns=,dictionary=0,format=btree,huffman_key=,huffman_value=,id=9,internal_item_max=0,internal_key_max=0,internal_key_truncate=,internal_page_max=4KB,key_format=q,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=32KB,leaf_value_max=1MB,memory_page_max=10m,os_cache_dirty_max=0,os_cache_max=0,prefix_compression=0,prefix_compression_min=4,split_deepen_min_child=0,split_deepen_per_child=0,split_pct=90,value_format=u,version=(major=1,minor=1)",
		"type" : "file",
		"uri" : "statistics:table:local/collection-2--4790446482890547801",
		"LSM" : {
			"bloom filters in the LSM tree" : 0,
			"bloom filter false positives" : 0,
			"bloom filter hits" : 0,
			"bloom filter misses" : 0,
			"bloom filter pages evicted from cache" : 0,
			"bloom filter pages read into cache" : 0,
			"total size of bloom filters" : 0,
			"sleep for LSM checkpoint throttle" : 0,
			"chunks in the LSM tree" : 0,
			"highest merge generation in the LSM tree" : 0,
			"queries that could have benefited from a Bloom filter that did not exist" : 0,
			"sleep for LSM merge throttle" : 0
		},
		"block-manager" : {
			"file allocation unit size" : 4096,
			"blocks allocated" : 176633,
			"checkpoint size" : 50154676224,
			"allocations requiring file extension" : 0,
			"blocks freed" : 206417,
			"file magic number" : 120897,
			"file major version number" : 1,
			"minor version number" : 0,
			"file bytes available for reuse" : 12083351552,
			"file size in bytes" : 62140203008
		},
		"btree" : {
			"column-store variable-size deleted values" : 0,
			"column-store fixed-size leaf pages" : 0,
			"column-store internal pages" : 0,
			"column-store variable-size leaf pages" : 0,
			"pages rewritten by compaction" : 0,
			"number of key/value pairs" : 0,
			"fixed-record size" : 0,
			"maximum tree depth" : 32,
			"maximum internal page key size" : 368,
			"maximum internal page size" : 4096,
			"maximum leaf page key size" : 3276,
			"maximum leaf page size" : 32768,
			"maximum leaf page value size" : 1048576,
			"overflow pages" : 0,
			"row-store internal pages" : 0,
			"row-store leaf pages" : 0
		},
		"cache" : {
			"bytes read into cache" : 5684148385,
			"bytes written from cache" : 5626150236,
			"checkpoint blocked page eviction" : 11413,
			"unmodified pages evicted" : 175446,
			"page split during eviction deepened the tree" : 1,
			"modified pages evicted" : 3911,
			"data source pages selected for eviction unable to be evicted" : 2675,
			"hazard pointer blocked page eviction" : 589,
			"internal pages evicted" : 1882,
			"pages split during eviction" : 585,
			"in-memory page splits" : 553,
			"overflow values cached in memory" : 0,
			"pages read into cache" : 178344,
			"overflow pages read into cache" : 0,
			"pages written from cache" : 176607
		},
		"compression" : {
			"raw compression call failed, no additional data available" : 0,
			"raw compression call failed, additional data available" : 0,
			"raw compression call succeeded" : 0,
			"compressed pages read" : 176000,
			"compressed pages written" : 173316,
			"page written failed to compress" : 3,
			"page written was too small to compress" : 3288
		},
		"cursor" : {
			"create calls" : 128,
			"insert calls" : 10935859,
			"bulk-loaded cursor-insert calls" : 0,
			"cursor-insert key and value bytes inserted" : 5679356593,
			"next calls" : 7327404,
			"prev calls" : 35,
			"remove calls" : 561,
			"cursor-remove key bytes removed" : 5049,
			"reset calls" : 10937358,
			"search calls" : 0,
			"search near calls" : 1122,
			"update calls" : 0,
			"cursor-update value bytes updated" : 0
		},
		"reconciliation" : {
			"dictionary matches" : 0,
			"internal page multi-block writes" : 25,
			"leaf page multi-block writes" : 605,
			"maximum blocks required for a page" : 236,
			"internal-page overflow keys" : 0,
			"leaf-page overflow keys" : 0,
			"overflow values written" : 0,
			"pages deleted" : 2780,
			"page checksum matches" : 5294,
			"page reconciliation calls" : 6475,
			"page reconciliation calls for eviction" : 1885,
			"leaf page key bytes discarded using prefix compression" : 0,
			"internal page key bytes discarded using suffix compression" : 175625
		},
		"session" : {
			"object compaction" : 0,
			"open cursor count" : 128
		},
		"transaction" : {
			"update conflicts" : 0
		}
	},
	"nindexes" : 0,
	"totalIndexSize" : 0,
	"indexSizes" : {
		
	},
	"ok" : 1
}



 Comments   
Comment by Alexander Gorrod [ 10/Apr/15 ]

Thanks for the update, and for being so helpful providing information - I'll chase it some more based on the information you already provided, but will close it I can't find anything obvious. We've got a couple of other cases that could be related that either have fixes or reproducible workloads.

Comment by Oleg Rekutin [ 09/Apr/15 ]

Alexander Gorrod, this has stopped being an issue for us at the moment, as our need to do the heavy insert loads has been abated (it was a one-time need, the cluster experiences moderate load normally). Unfortunately, I have not had a chance to replicate this scenario in a test environment where I could send similar load, and absent the original load pressure, I am not seeing this issue in day-to-day. I apologize that I cannot get you the extra system information at this time. If you have ideas for other things I can check for when this happens again, let me know.

I hate to leave issues hanging like this--if it helps, we can close right now, and I can reopen once there is an opportunity for me to instrument/collect stats for this situation.

Comment by Alexander Gorrod [ 09/Apr/15 ]

Hi oleg@evergage.com, is this still an issue for you? Is it difficult for you to gather the system information I requested? If so I can think about other things that might help us resolve the problem.

Thanks,
Alex

Comment by Alexander Gorrod [ 24/Mar/15 ]

I've attached a tool to this ticket, gdbmon.py, that will use gdb to collect stack trace samples at regular intervals; this should give us a clearer picture of what's going on inside mongod. Please be aware that using this tool will have some performance (and potentially even functional) impact on the mongod in question; but since it is not performing well as it is, this may be a risk you are willing to accept. If you are willing to do so, please do the following:

  • install gdb on the node with the issue
  • while the secondary is in the low-performing state run the following two data collection scripts in parallel:

    mongo --eval "while(true) {print(JSON.stringify(db.serverStatus({tcmalloc:1}))); sleep(1000)}" >ss.log &
    python gdbmon.py $(pidof mongod) 5 >gdbmon.log &
    

    This will collect serverStatus time series as before, and in addition stack trace samples at 5-second intervals.

  • stop the commands after about 10 minutes
  • then upload ss.log and gdbmon.log for us to review.
Comment by Oleg Rekutin [ 24/Mar/15 ]

The oplog is 120000MB. The collections being written to have three indices each, including the _id index. I'm not sure what you mean by "tables". There were four heavy write concurrent workloads to four different databases (one collection in each), alongside with a small amount of consistent bursty insertions to 10-15 collections, and then finally small sporadic writes to another ~25-100 collections or so. Each collection is in its own database in this situation.

My theory is that it at this increased throughput, the server is hitting starvation of some kind that's causing it to not write out the replication log at full speed. At the time of the problem, iostat -x 5 showed IO utilization on the node at 49%, 81%, 23% respectively after 15 seconds. So it look like we weren't reaching the full IO limit. Also, analyzing CloudWatch monitoring stats, I see that the node was running around at about 800-900 IOPs (about 30% of the total IOPS available to the underlying EBS SSD storage).

Anyways, I would normally expect to see IO saturation behavior to "flatten out" and cause a smooth growth in replication lag. What I'm seeing where from MMS charts is that when nodes get into trouble, they have "spiky" replication insert rates.

I realized I made a mistake earlier: the underlying storage is EBS GP2 SSD volumes, NOT ephemeral SSDs. The guaranteed throughput is 3000 IOPS, which supports up to 48Mbps, for these volumes.

Comment by Alexander Gorrod [ 24/Mar/15 ]

Thanks for the additional information. I've been looking through the logs and I can see some oddities, but nothing that obviously uncovers the issue. I'll keep digging and let you know when I've got more information.

A couple of questions:

  • How big is the oplog configured to be? It looks to me like it's 128GB - is that right?
  • How many tables and indexes do you have in your collection?

Thanks,
Alex

Comment by Oleg Rekutin [ 23/Mar/15 ]

Alexander, you said:

> Does this mean that you expect the shard that is exhibiting performance problems to handle greater load than the one that is operating as expected? i.e: Is the cause here that there is a greater load on that particular replica?

I think that high load is related, but I doubt it's because it's relatively greater on that particular replica. Also, the problem replica has moved around depending on the day (so it's not constrained to happening only on one shard replica). I've seen this come up when the load was identical across shards, where there were shards that were OK despite same load.

> Are there other differences between the good/bad nodes?

There are no differences in setup. They are running on identical AWS/EC2 hardware, provisioned identically using Chef. I haven't seen other behavior differences.

>Can you send information about the replica that is exhibiting poor performance:
> How much RAM is in-use on the problem replica? Is it possible that the machine is paging?

Problem replica:

$ free -m
             total       used       free     shared    buffers     cached
Mem:         29909      29110        799          0          6       3789
-/+ buffers/cache:      25314       4595
Swap:            0          0          0

Good node:

free -m
             total       used       free     shared    buffers     cached
Mem:         29909      29703        206          0          9       5269
-/+ buffers/cache:      24423       5486
Swap:            0          0          0

It looks like there is no paging. I looked at the page fault chart and it is attached, the highest value was 0.4 for the problem node. The workload is almost entirely insertions.

> What operating system are you running on?

Amazon Linux 64-bit: 3.4.73-64.112.amzn1.x86_64 #1 SMP.

> What is the disk subsystem like? (SSD, spinning disk, etc)

Ephemeral SSD instance storage.

Comment by Oleg Rekutin [ 23/Mar/15 ]

Attached are also MMS charts for this time period (on the chart is 12:28-12:44 covering the ss.log files). You can see in the first chart the insertion loads are pretty similar across the shards.

In the second chard, repl_insert_chart, you can see a very jagged pink line, that's the bad shard replica node. Its repl insert processing rate is all over the place.

Comment by Oleg Rekutin [ 23/Mar/15 ]

Attached are about 15 minutes worth of logs. ss-bad is a node that was struggling and was not reducing its replication lag. shard-bad-repl-drain.log is the log from the server for that time period. You can see the slow server status calls.

Comment by Alexander Gorrod [ 23/Mar/15 ]

Could you please provide more information. You mention:

> This "good shard" is experiencing ~7K insert operations vs ~9K inserts on the "problem shard."

Does this mean that you expect the shard that is exhibiting performance problems to handle greater load than the one that is operating as expected? i.e: Is the cause here that there is a greater load on that particular replica?

Are there other differences between the good/bad nodes? Can you send information about the replica that is exhibiting poor performance:

  • How much RAM is in-use on the problem replica? Is it possible that the machine is paging?
  • What operating system are you running on?
  • What is the disk subsystem like? (SSD, spinning disk, etc)
  • Attach the server log from the replica that is slow.

It would be very helpful if you can gather statistics over a period of time (for both a good and bad replica). To do that you can:

  • start mongod as usual
  • start the following data collection script:

    mongo --eval "while(true) {print(JSON.stringify(db.serverStatus({tcmalloc:1}))); sleep(1000)}" >ss.log
    

  • Stop the above command after about 10 minutes.
  • Upload ss.log for us to review.

Thanks,
Alex

Comment by Oleg Rekutin [ 20/Mar/15 ]

ramon.fernandez, I don't have access to re-open this issue, should I reach out to someone out of band to re-open this?

Comment by Oleg Rekutin [ 20/Mar/15 ]

Attached are the oplog stats for first node with the problem and a good node.

The biggest difference I see is this, on a bad node:

		"reconciliation" : {
			"pages deleted" : 182429,
			"page reconciliation calls" : 1388768,
			"page reconciliation calls for eviction" : 1221002,

Good node:

		"reconciliation" : {
			"pages deleted" : 1508767,
			"page reconciliation calls" : 10211897,
			"page reconciliation calls for eviction" : 9789649,

Notice that the bad node has an order of magnitude smaller "pages deleted" and "page reconciliation calls" numbers. The other bad node (that's not as bad), I just checked, has these numbers at about half of that of the good node.

Could it be that the oplog is not being consistently drained or the draining of it is starved somehow?

Comment by Oleg Rekutin [ 20/Mar/15 ]

Comparing server statuses between first bad node and good node, I am seeing the following major differences (bad node vs good node):

reconciliation."split bytes currently awaiting free" : 1696932 vs 9896
concurrentTransactions.write.out is 13 vs 0

repl.buffer is drastically different, bad node:

			"buffer" : {
				"count" : NumberLong(598480),
				"maxSizeBytes" : 268435456,
				"sizeBytes" : NumberLong(266839253)
			},

good node:

			"buffer" : {
				"count" : NumberLong(0),
				"maxSizeBytes" : 268435456,
				"sizeBytes" : NumberLong(0)
			},

Comment by Oleg Rekutin [ 20/Mar/15 ]

Okay, looks like I am seeing this problem come back. After a day of running on 3.0.1, I saw one of the nodes start building replication lag to 300-400 seconds whereas other nodes peaked at maybe 10 seconds. Then another node encountered the problem, just not as severely.

Attaching 3 outputs of db.serverStatus(), two from nodes with problems and one node without problems.

Comment by Ramon Fernandez Marina [ 19/Mar/15 ]

Thanks for the update oleg@evergage.com. I'm resolving this ticket now, but please feel free to reopen and provide db.serverStatus() info if the issue happens again.

Regards,
Ramón.

Comment by Oleg Rekutin [ 19/Mar/15 ]

I just upgraded the cluster to 3.0.1 due to a critical bug, so I am now running a 15% heavier insert workload and it is stable/coping so far. I will try to ramp up the load before declaring this fixed, or keeping an eye on it.

I will definitely capture db.serverStatus next time! Thank you.

Comment by Alexander Gorrod [ 19/Mar/15 ]

Could you also upload the output from db.serverStatus from the replica shard that is having trouble?

Comment by Oleg Rekutin [ 18/Mar/15 ]

additional_oplog_stats.json is the second run of oplog stats from the problem node, run approx 7-10 minutes after the first one in this issue.

Comment by Oleg Rekutin [ 18/Mar/15 ]

Second screenshot shows a replica shard in the system that does NOT have a problem and is showing a fairly consistent replica inserts rate.

This "good shard" is experiencing ~7K insert operations vs ~9K inserts on the "problem shard."

Comment by Oleg Rekutin [ 18/Mar/15 ]

First screenshot shows problem replica node, showing start of replication lag growth and uneven repl: insert operation counters. You can see where after the restart, it kept up for a bit, but then fell behind again.

Generated at Thu Feb 08 03:45:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.