[SERVER-16921] WT oplog bottleneck on secondary Created: 17/Jan/15  Updated: 18/Sep/15  Resolved: 26/Jan/15

Status: Closed
Project: Core Server
Component/s: Replication, WiredTiger
Affects Version/s: 2.8.0-rc5
Fix Version/s: 3.0.0-rc7

Type: Bug Priority: Major - P3
Reporter: Cailin Nelson Assignee: Eliot Horowitz (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File capped_as_neeed.png     File capped_truncate.js     File capped_truncate_parallel.js     PNG File mms-db-1-repl-progress.png     PNG File mms-db-1.ny1-small-oplog.png     PNG File mms-db-1.ny1.10gen.cc-snapshot.png     PNG File mmsconfig1-2015-01-18.png     PNG File mmsconfig1.png     PNG File pri.png     PNG File sec.png    
Issue Links:
Depends
Duplicate
is duplicated by SERVER-17046 FASSERT 18750 in rsBackgroundSync Closed
is duplicated by SERVER-17055 secondary cannot keep up once oplog h... Closed
is duplicated by SERVER-17074 Sharded Replicaset - replicas fall be... Closed
Related
related to SERVER-16868 Significant slowdown in capped collec... Closed
related to SERVER-17083 Background thread for oplog clean-up ... Closed
related to SERVER-17094 repairing local db with WiredTiger wi... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

We converted one secondary to 2.8.0-rc5 at approximately 16:00 UTC on 2015-01-016. The throughput observed during the "oplog catch-up" phases of the initial sync was truly impressive!

The secondary continued to perform as expected until 2015-01-17 at approximately 11:00 UTC. At this point it seems that the throughput of the secondary dropped enough that it can no longer keep up with the 2.6.5 primary. The replication lag is increasing steadily, but in a linear fashion. Thus, the throughput dropped to X% of the primary load, but that X% has been holding steady since 11:00 - X does not seem to be decreasing.

There is nothing obvious in the logs (will attach anyway) other than a gradual increase in the frequency of "serverStatus was very slow" messages. serverStatus appears to be slow in the oplog phase. (Presumably, this is symptom, not cause, but mentioning anyway).

      0       0       0
[cnelson@mms-db-1.ny1 /var/log/mongod-mmsconfig1-prod]$ grep "was very slow" mongod.log | grep "2015-01-17T04" | wc
      0       0       0
[cnelson@mms-db-1.ny1 /var/log/mongod-mmsconfig1-prod]$ grep "was very slow" mongod.log | grep "2015-01-17T05" | wc
      0       0       0
[cnelson@mms-db-1.ny1 /var/log/mongod-mmsconfig1-prod]$ grep "was very slow" mongod.log | grep "2015-01-17T06" | wc
      0       0       0
[cnelson@mms-db-1.ny1 /var/log/mongod-mmsconfig1-prod]$ grep "was very slow" mongod.log | grep "2015-01-17T07" | wc
      1      61     421
[cnelson@mms-db-1.ny1 /var/log/mongod-mmsconfig1-prod]$ grep "was very slow" mongod.log | grep "2015-01-17T08" | wc
      3     183    1266
[cnelson@mms-db-1.ny1 /var/log/mongod-mmsconfig1-prod]$ grep "was very slow" mongod.log | grep "2015-01-17T09" | wc
     12     732    5076
[cnelson@mms-db-1.ny1 /var/log/mongod-mmsconfig1-prod]$ grep "was very slow" mongod.log | grep "2015-01-17T10" | wc
     53    3233   22402
[cnelson@mms-db-1.ny1 /var/log/mongod-mmsconfig1-prod]$ grep "was very slow" mongod.log | grep "2015-01-17T11" | wc
    147    8967   62167
[cnelson@mms-db-1.ny1 /var/log/mongod-mmsconfig1-prod]$ grep "was very slow" mongod.log | grep "2015-01-17T12" | wc
    154    9394   65086
[cnelson@mms-db-1.ny1 /var/log/mongod-mmsconfig1-prod]$ grep "was very slow" mongod.log | grep "2015-01-17T13" | wc
    151    9211   63832
[cnelson@mms-db-1.ny1 /var/log/mongod-mmsconfig1-prod]$ grep "was very slow" mongod.log | grep "2015-01-17T14" | wc
    142    8662   60027
[cnelson@mms-db-1.ny1 /var/log/mongod-mmsconfig1-prod]$ grep "was very slow" mongod.log | grep "2015-01-17T15" | wc



 Comments   
Comment by Githook User [ 26/Jan/15 ]

Author:

{u'username': u'erh', u'name': u'Eliot Horowitz', u'email': u'eliot@10gen.com'}

Message: SERVER-16921: background thread for oplog cleanup

(cherry picked from commit 1f394a5db22c782e88f24db8c0d5267f89f0efc8)
Branch: v3.0
https://github.com/mongodb/mongo/commit/0c3f6afee5c9246918584b1e2d93c03ebb04c6f1

Comment by Githook User [ 26/Jan/15 ]

Author:

{u'username': u'erh', u'name': u'Eliot Horowitz', u'email': u'eliot@10gen.com'}

Message: SERVER-16921: background thread for oplog cleanup
Branch: master
https://github.com/mongodb/mongo/commit/1f394a5db22c782e88f24db8c0d5267f89f0efc8

Comment by Daniel Pasette (Inactive) [ 26/Jan/15 ]

With the attached workload (capped_truncate_parallel.js), i'm seeing storageSize continue to grow slowly.

Run this to monitor storageSize growth:

dan:(git)mongo[master]/$ while sleep 2;do  mongo --quiet --eval "printjson('storageSize: ' + db.capped.stats(1024*1024).storageSize + ', size: ' + db.capped.stats(1024*1024).size)"; done;

Comment by Daniel Pasette (Inactive) [ 25/Jan/15 ]
  • Here's the busiest call stack when the server starts grinding to a halt:
Comment by Daniel Pasette (Inactive) [ 25/Jan/15 ]

Open questions. Stats taken from before restart:

  • Why are there so many open sessions/cursors? This is a hidden secondary only applying ops from the oplog:

                "session": {
                    "open session count": 137,
                    "open cursor count": 9863
                },

  • Why are so many transactions rolled back?

                    "transaction begins": 23572391,
                    "transactions rolled back": 5077844,
                    "transactions committed": 18569450
                    "transaction range of IDs currently pinned": 0,

  • Keith mentioned he thought that this stat in db.serverStatus().wiredTiger.cache was a bit high considering how long the server had been up.

    "pages evicted because they had chains of deleted items": 103566

    This seems to be caused by deleting from the oplog once it wraps around (which this one has). Using gdb sampling, I was able to see that the majority of the time spent on the server is in __wt_delete_page_skip. Keith says:

    Basically: when a cursor is wandering through a tree, and finds a "deleted" page, it does this little dance to say "do I need to see that page"? Normally, it shouldn't need to see it because the transaction is from long ago. That test does single-thread.

  • Here are the collection stats for the oplog. Unfortunately, this was taken after the server was restarted, so it's not a perfect snapshot. The btree depth seems deeper than I would think for a collection this size. Could this be something?

    mmsconfig1:SECONDARY> db.oplog.rs.stats();
    {
            "ns" : "local.oplog.rs",
            "count" : 8759339,
            "size" : 5242879651,
            "avgObjSize" : 598,
            "storageSize" : 4440723456,
            "capped" : true,
            "max" : -1,
            "maxSize" : NumberLong("5242880000"),
            "wiredTiger" : {
                    "metadata" : {
                            "formatVersion" : 1,
                            "oplogKeyExtractionVersion" : 1
                    },
                    "creationString" : "allocation_size=4KB,app_metadata=(formatVersion=1,oplogKeyExtractionVersion=1),block_allocation=best,block_compressor=snappy,cache_resident=0,checkpoint=(WiredTigerCheckpoint.5335=(addr=\"01e301145781e455f619b7e3013f8b86e48f011e10e3013f9185e4270e6d06808080e50108afefc0e4ebc15fc0\",order=5335,time=1422150594,size=3955326976,write_gen=14069223)),checkpoint_lsn=(6551,70419712),checksum=uncompressed,collator=,columns=,dictionary=0,format=btree,huffman_key=,huffman_value=,id=10,internal_item_max=0,internal_key_max=0,internal_key_truncate=,internal_page_max=4KB,key_format=q,key_gap=10,leaf_item_max=0,leaf_key_max=0,leaf_page_max=32KB,leaf_value_max=1MB,memory_page_max=10m,os_cache_dirty_max=0,os_cache_max=0,prefix_compression=0,prefix_compression_min=4,split_pct=90,value_format=u,version=(major=1,minor=1)",
                    "type" : "file",
                    "uri" : "statistics:table:collection-8-443056350544942853",
                    "LSM" : {
                            "bloom filters in the LSM tree" : 0,
                            "bloom filter false positives" : 0,
                            "bloom filter hits" : 0,
                            "bloom filter misses" : 0,
                            "bloom filter pages evicted from cache" : 0,
                            "bloom filter pages read into cache" : 0,
                            "total size of bloom filters" : 0,
                            "sleep for LSM checkpoint throttle" : 0,
                            "chunks in the LSM tree" : 0,
                            "highest merge generation in the LSM tree" : 0,
                            "queries that could have benefited from a Bloom filter that did not exist" : 0,
                            "sleep for LSM merge throttle" : 0
                    },
                    "block-manager" : {
                            "file allocation unit size" : 4096,
                            "blocks allocated" : 112510,
                            "checkpoint size" : 3955326976,
                            "allocations requiring file extension" : 0,
                            "blocks freed" : 112742,
                            "file magic number" : 120897,
                            "file major version number" : 1,
                            "minor version number" : 0,
                            "file bytes available for reuse" : 507191296,
                            "file size in bytes" : 4440723456
                    },
                    "btree" : {
                            "column-store variable-size deleted values" : 0,
                            "column-store fixed-size leaf pages" : 0,
                            "column-store internal pages" : 0,
                            "column-store variable-size leaf pages" : 0,
                            "pages rewritten by compaction" : 0,
                            "number of key/value pairs" : 0,
                            "fixed-record size" : 0,
                            "maximum tree depth" : 8,
                            "maximum internal page key size" : 368,
                            "maximum internal page size" : 4096,
                            "maximum leaf page key size" : 3276,
                            "maximum leaf page size" : 32768,
                            "maximum leaf page value size" : 1048576,
                            "overflow pages" : 0,
                            "row-store internal pages" : 0,
                            "row-store leaf pages" : 0
                    },
                    "cache" : {
                            "bytes read into cache" : 3497984597,
                            "bytes written from cache" : 3507880275,
                            "checkpoint blocked page eviction" : 0,
                            "unmodified pages evicted" : 0,
                            "modified pages evicted" : 112023,
                            "data source pages selected for eviction unable to be evicted" : 363,
                            "hazard pointer blocked page eviction" : 326,
                            "internal pages evicted" : 0,
                            "in-memory page splits" : 317,
                            "overflow values cached in memory" : 0,
                            "pages read into cache" : 111537,
                            "overflow pages read into cache" : 0,
                            "pages written from cache" : 112414
                    },
                    "compression" : {
                            "raw compression call failed, no additional data available" : 0,
                            "raw compression call failed, additional data available" : 0,
                            "raw compression call succeeded" : 0,
                            "compressed pages read" : 107103,
                            "compressed pages written" : 107395,
                            "page written failed to compress" : 3221,
                            "page written was too small to compress" : 1798
                    },
                    "cursor" : {
                            "create calls" : 256,
                            "insert calls" : 5716011,
                            "bulk-loaded cursor-insert calls" : 0,
                            "cursor-insert key and value bytes inserted" : 3467882588,
                            "next calls" : 11472040,
                            "prev calls" : 334,
                            "remove calls" : 2457666,
                            "cursor-remove key bytes removed" : 22118994,
                            "reset calls" : 10636450,
                            "search calls" : 4915332,
                            "search near calls" : 4915332,
                            "update calls" : 0,
                            "cursor-update value bytes updated" : 0
                    },
                    "reconciliation" : {
                            "dictionary matches" : 0,
                            "internal page multi-block writes" : 82,
                            "leaf page multi-block writes" : 473,
                            "maximum blocks required for a page" : 294,
                            "internal-page overflow keys" : 0,
                            "leaf-page overflow keys" : 0,
                            "overflow values written" : 0,
                            "pages deleted" : 115764,
                            "page checksum matches" : 20879,
                            "page reconciliation calls" : 119005,
                            "page reconciliation calls for eviction" : 111907,
                            "leaf page key bytes discarded using prefix compression" : 0,
                            "internal page key bytes discarded using suffix compression" : 122724
                    },
                    "session" : {
                            "object compaction" : 0,
                            "open cursor count" : 1702
                    },
                    "transaction" : {
                            "update conflicts" : 0
                    }
            },
            "nindexes" : 0,
            "totalIndexSize" : 0,
            "indexSizes" : {
                   
            },
            "ok" : 1
    }

Comment by Daniel Pasette (Inactive) [ 25/Jan/15 ]

Comment by Daniel Pasette (Inactive) [ 25/Jan/15 ]

It seems we're getting batches of updates every 20-25 seconds.

Comment by Cailin Nelson [ 18/Jan/15 ]

Gathering up some oplog stats....

Size of the oplog is still 11.4ish hours

"oplog": {
            "latestOptime": {"$timestamp": {
                "t": 1421587377,
                "i": 187
            }},
            "earliestOptime": {"$timestamp": {
                "t": 1421546462,
                "i": 893
            }}
        },

We don't specify an oplog size in our config, so MongoDB chose a very slightly different size this time. Now it is 30.6 GB.

mmsconfig1:SECONDARY> new Date()
ISODate("2015-01-18T14:29:17.461Z")
mmsconfig1:SECONDARY> db.getSiblingDB("local").oplog.rs.stats()
{
	"ns" : "local.oplog.rs",
	"count" : 52031335,
	"size" : 30585083654,
	"avgObjSize" : 587,
	"storageSize" : 22604181504,
	"capped" : true,
	"max" : -1,
	"maxSize" : NumberLong("30585075712"),

Comment by Cailin Nelson [ 18/Jan/15 ]

We did a fresh resync of the wiredTiger node last night, starting at 2015-01-18 1:13 UTC. I did the resync the old-fashioned way. i.e. shut down the process, rm -rf the data directory, restart. Thus, my expectation would be that we were starting from exactly the same point we did on our first sync.

There was, however, one difference. On the first sync, I had set cacheSizeGB=10 as I erroneously counted the significant digits in top and thought the server had 39GB of RAM. John pointed out it actually has 390GB of RAM so a couple of hours later we changed to cacheSizeGB=50. When we made this change, there were no observable changes in behavior. We retained this setting for the second sync. Other than that, conditions should have been identical.

The outcome was quite different though - in this case the WT secondary fell behind much more quickly. The various memory stats in MMS also looked "odd" to me. Please see annotated MMS screenshot below. I will upload logs.

Comment by Cailin Nelson [ 17/Jan/15 ]

Ah, got it. Here are a couple of data points:

mmsconfig1:SECONDARY> new Date();
ISODate("2015-01-17T22:00:49.239Z")
mmsconfig1:SECONDARY>  db.getSiblingDB("local").oplog.rs.stats()
{
	"ns" : "local.oplog.rs",
	"count" : 52830580,
	"size" : 30708937877,
	"avgObjSize" : 581,
	"storageSize" : 23040737280,
	"capped" : true,
	"max" : -1,
	"maxSize" : NumberLong("30708911360"),
....

mmsconfig1:SECONDARY> new Date()
ISODate("2015-01-17T22:21:47.075Z")
mmsconfig1:SECONDARY> db.getSiblingDB("local").oplog.rs.stats()
{
	"ns" : "local.oplog.rs",
	"count" : 52826075,
	"size" : 30708926796,
	"avgObjSize" : 581,
	"storageSize" : 23056039936,
	"capped" : true,
	"max" : -1,
	"maxSize" : NumberLong("30708911360"),
	"wiredTiger" : {
...

Comment by Scott Hernandez (Inactive) [ 17/Jan/15 ]

Due to how capped collections work in wired tiger at the moment it is possible that the oplog is growing (and the cap is being pushed higher than orig. allocated), not just filling up. Here is an example:

> db.getSiblingDB("local").oplog.rs.stats()
{
	"ns" : "local.oplog.rs",
	"count" : 1,
	"size" : 86,
	"storageSize" : 4096,
	"capped" : true,
	"max" : -1,
	"maxSize" : 1048576,
...
// lots of big writes
// New stats
{
	"ns" : "local.oplog.rs",
	"count" : 66,
	"size" : 66006138,
	"storageSize" : 9502720,
	"capped" : true,
	"max" : -1,
	"maxSize" : 1048576,
	"wiredTiger" : {
		"metadata" : {
...
{
	"ns" : "local.oplog.rs",
	"count" : 27,
	"size" : 27002511,
	"avgObjSize" : 1000093,
	"capped" : true,
	"max" : -1,
	"maxSize" : 1048576,
	"wiredTiger" : {
 

Comment by Cailin Nelson [ 17/Jan/15 ]

No, the oplog is not growing anymore. It reached its maximum size about 11 hours before the performance degradation began. (See earlier comment.) SERVER-16868 sounds like a fit, although it took awhile after the oplog cap was reached.

Comment by Scott Hernandez (Inactive) [ 17/Jan/15 ]

I suspect you are seeing SERVER-16868. Is the oplog growing?

Comment by Cailin Nelson [ 17/Jan/15 ]

Some possibly significant timestamps here:

  • Initial sync started: 2015-01-16 14:52
  • Initial sync completed: 2015-01-16 17:13 (ish)
  • Size of oplog: 11.4 hours (ish), 30.7 GB
  • Time of first performance issue: 2015-01-17 11:00 (ish)

So, the performance issues started close to the time when the system was recycling the allocated oplog space for the second time.

Generated at Thu Feb 08 03:42:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.