[SERVER-16651] Mass deletions and updates (including TTL) completely starve insertions in replica set under WiredTiger Created: 23/Dec/14  Updated: 16/Jan/15  Resolved: 29/Dec/14

Status: Closed
Project: Core Server
Component/s: Performance, Replication, Storage
Affects Version/s: 2.8.0-rc3, 2.8.0-rc4
Fix Version/s: 2.8.0-rc5

Type: Bug Priority: Critical - P2
Reporter: Bruce Lucas (Inactive) Assignee: Andy Schwerin
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File blocking-updates.png     PNG File profile.png     PNG File wt-rc3-del-rs0.png     PNG File wt-rc3-del-rs1.png     PNG File wt-rc3-rs0.png     PNG File wt-rc3-rs1.png    
Issue Links:
Related
related to SERVER-16667 Isolated writes with extreme latencie... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

Set up a ttl collection with documents expiring after 30 seconds, and then start inserting:

    var count = 10000000
    var every = 10000
 
    db.c.drop();
    db.c.ensureIndex({ttl: 1}, {expireAfterSeconds: 30});
    db.c.ensureIndex({i:1})
 
    for (var i=0; i<count; ) {
        var bulk = db.c.initializeUnorderedBulkOp();
        for (var j=0; j<every; i++, j++)
            bulk.insert({i: Math.random(), ttl: new Date()});
        bulk.execute();
        print(i)
    }

In a single-node replica set the TTL deletion passes almost entirely suppress the insertions - the spikes on the second graph indicate the start of TTL deletion passes.

Whereas with a standalone node, the TTL deletions have almost no noticeable effect on the insertions.

Gdb sampling profile shows the insertions being blocked and the TTL thread active, but does not immediately make it clear why the insertion thread is blocked the whole time - the TTL thread is doing a variety of things, many of which it will also be doing in the case of a standalone instance, so why should inserts be blocked in the case of a replica set and not standalone?

Under mmapv1 the TTL deletions interfere with the insertions but do not completely suppress them, both standalone and with replica set.



 Comments   
Comment by Pieter Jordaan [ 16/Jan/15 ]

Unrelated to the issue, but how do you generate the profiling graphs as shown in the original post?

Comment by Githook User [ 29/Dec/14 ]

Author:

{u'username': u'andy10gen', u'name': u'Andy Schwerin', u'email': u'schwerin@mongodb.com'}

Message: SERVER-16651 Ensure that single-node replica sets pass through SECONDARY on way to PRIMARY.

Before this patch, such sets would transition directly from RECOVERING to
PRIMARY at start-up, which is not a valid arc in the state transition diagram
for replica set member states.

This patch also consolidates the logic for transitioning from SECONDARY to
PRIMARY in one place and resolves some minor concurrency and consistency errors
therein.
Branch: master
https://github.com/mongodb/mongo/commit/cf2fc69b4ca25d8dc0b282db6556b8d50ffefc14

Comment by Githook User [ 29/Dec/14 ]

Author:

{u'username': u'andy10gen', u'name': u'Andy Schwerin', u'email': u'schwerin@mongodb.com'}

Message: SERVER-16651 Avoid global lock acquisition in tryToGoLiveAsASecondary() if a primary or secondary.

Only nodes in state RECOVERING will attempt a state transition in
tryToGoLiveAsASecondary(), and we have a lock-free way to check if a node is in
one of those two states. The danger of asking the question, "is this node in
either primary or secondary state?" without holding the global lock is that the
answer might change at any time. However, only tryToGoLiveAsASecondary() causes
it to transition from "no" to "yes", and if it transitions from "yes" to "no",
we know that tryToGoLiveAsASecondary() will be called again promptly.
Branch: master
https://github.com/mongodb/mongo/commit/acc0baca79b412794002ccdb6d37f9b2acf4f3c9

Comment by Eric Milkie [ 29/Dec/14 ]

It looks like my commit at f852fd7474b44de5c33e0a7324c107fc1456da74 contributed to this issue. The original intention of the code was to not lock global IS every second when the node state was PRIMARY, but that commit broke that.
The problem is that the applier needs to run while we are in "drain mode", which is a special state that looks like PRIMARY but writes are not yet accepted and replicated ops are still applied.
I believe Andy's proposed fix will solve this issue.

Comment by Andy Schwerin [ 24/Dec/14 ]

Analysis with vtune suggests that this is caused by tryToGoLiveAsSecondary locking the global lock in MODE_S. In essence, because the long-running delete never yields its MODE_IX lock on the global lock, inserts eventually get stuck behind the oplog application thread's tryToGoLiveAsSecondary() call. In mmapv1, the yielding behavior prevents this from generating long delays.

There are two potential actions we can take. One is to enable yielding in mmapv1, as per SERVER-16640. This is more of a bandaid, because it still reduces the fraction of the time that writes may run concurrently.

The other is to eliminate the MODE_S global lock acquisition in tryToGoLiveAsSecondary on primary and secondary nodes. The lock acquisition is unnecessary when the node is in either PRIMARY or SECONDARY state, because we will never transition to secondary in tryToGoLiveAsSecondary from those states. Furthermore, we have a safe, lock-free way to answer the question, which is to consult the value of ReplicationCoordinatorImpl::_canServeNonLocalReads. That value only transitions from false to true synchronously with actions taken by tryToGoLiveAsASecondary, so will never return a spurious true. A spurious false is not important, because we will simply try again soon, staying in RECOVERING slightly longer than would otherwise be necessary.

A quick prototype indicates that this eliminates the blocking of inserts while long deletes are running.

Comment by Bruce Lucas (Inactive) [ 23/Dec/14 ]

schwerin, based on the call tree above it doesn't appear to me that the TTL deletions are spending much of their time in a part of the call tree that I would have thought would require that they serialize access to that data structure, unless perhaps they are holding that lock (much) longer than necessary.

Comment by Bruce Lucas (Inactive) [ 23/Dec/14 ]

Yes, as suggested TTL is not key - mass deletes will do the same thing.

// one thread doing this
function ins() {
 
    var count = 1000000
 
    db.c.drop();
    db.c.ensureIndex({not_ttl: 1})
    db.c.ensureIndex({i:1})
 
    for (var i=0; i<count; ) {
        var bulk = db.c.initializeUnorderedBulkOp();
        for (var j=0; j<every; i++, j++)
            bulk.insert({i: Math.random(), not_ttl: new Date()});
        bulk.execute();
        print(i)
    }
}
 
// other thread doing this
function del() {
    while (true) {
        sleep(10*1000)
        print('removing')
        var r = db.c.remove({})
        printjson(r)
    }
}

As with TTL mass deletes starve inserts in the case of a replica set:

Without the oplog the inserts are not throttled in the mongod layer, and appear to compete in the wt layer on more-or-less equal footing with the deletes, judging from the wt op counters (A markes the point where the deletes begin):

Why does the presence of the oplog together with deletes create a bottleneck, apparently in the mongod layer (based on the mongod-layer lock seen in the stack tree above), that completely starves inserts? If it's a question of additional work, what additional work does the oplog create for deletes that so far exceeds the amount of additional work created for inserts that the deletes essentially starve the inserts to zero?

Comment by Andy Schwerin [ 23/Dec/14 ]

Notably, the oplog is a serialization point. There is a small amount if in-memory data that must be accessed serially. IIRC, it's for generating new timestamps that meet the oplog ordering requirements.

Comment by Scott Hernandez (Inactive) [ 23/Dec/14 ]

And the replicaset/single-node behavior is just the cost of the doing the replication operations. A replica primary does more work per document and you get longer blocks, and less other work getting in there....

Comment by Scott Hernandez (Inactive) [ 23/Dec/14 ]

Does this really have anything to do with TTL, or is it really about massive deletes starving other write operations? Can you repro this issuing the same delete manually, and not using TTLs?

Also, if that proves fruitful then replacing remove/delete with a multi-update of the same subset of data, might also prove to cause the same behavior.

.... oh, just realized this is because WiredTiger doesn't yield...

Generated at Thu Feb 08 03:41:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.