[SERVER-16651] Mass deletions and updates (including TTL) completely starve insertions in replica set under WiredTiger Created: 23/Dec/14 Updated: 16/Jan/15 Resolved: 29/Dec/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Performance, Replication, Storage |
| Affects Version/s: | 2.8.0-rc3, 2.8.0-rc4 |
| Fix Version/s: | 2.8.0-rc5 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Bruce Lucas (Inactive) | Assignee: | Andy Schwerin |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
Set up a ttl collection with documents expiring after 30 seconds, and then start inserting:
In a single-node replica set the TTL deletion passes almost entirely suppress the insertions - the spikes on the second graph indicate the start of TTL deletion passes. Whereas with a standalone node, the TTL deletions have almost no noticeable effect on the insertions. Gdb sampling profile shows the insertions being blocked and the TTL thread active, but does not immediately make it clear why the insertion thread is blocked the whole time - the TTL thread is doing a variety of things, many of which it will also be doing in the case of a standalone instance, so why should inserts be blocked in the case of a replica set and not standalone?
Under mmapv1 the TTL deletions interfere with the insertions but do not completely suppress them, both standalone and with replica set. |
| Comments |
| Comment by Pieter Jordaan [ 16/Jan/15 ] | |||||||||||||||||||||||||||
|
Unrelated to the issue, but how do you generate the profiling graphs as shown in the original post? | |||||||||||||||||||||||||||
| Comment by Githook User [ 29/Dec/14 ] | |||||||||||||||||||||||||||
|
Author: {u'username': u'andy10gen', u'name': u'Andy Schwerin', u'email': u'schwerin@mongodb.com'}Message: Before this patch, such sets would transition directly from RECOVERING to This patch also consolidates the logic for transitioning from SECONDARY to | |||||||||||||||||||||||||||
| Comment by Githook User [ 29/Dec/14 ] | |||||||||||||||||||||||||||
|
Author: {u'username': u'andy10gen', u'name': u'Andy Schwerin', u'email': u'schwerin@mongodb.com'}Message: Only nodes in state RECOVERING will attempt a state transition in | |||||||||||||||||||||||||||
| Comment by Eric Milkie [ 29/Dec/14 ] | |||||||||||||||||||||||||||
|
It looks like my commit at f852fd7474b44de5c33e0a7324c107fc1456da74 contributed to this issue. The original intention of the code was to not lock global IS every second when the node state was PRIMARY, but that commit broke that. | |||||||||||||||||||||||||||
| Comment by Andy Schwerin [ 24/Dec/14 ] | |||||||||||||||||||||||||||
|
Analysis with vtune suggests that this is caused by tryToGoLiveAsSecondary locking the global lock in MODE_S. In essence, because the long-running delete never yields its MODE_IX lock on the global lock, inserts eventually get stuck behind the oplog application thread's tryToGoLiveAsSecondary() call. In mmapv1, the yielding behavior prevents this from generating long delays. There are two potential actions we can take. One is to enable yielding in mmapv1, as per The other is to eliminate the MODE_S global lock acquisition in tryToGoLiveAsSecondary on primary and secondary nodes. The lock acquisition is unnecessary when the node is in either PRIMARY or SECONDARY state, because we will never transition to secondary in tryToGoLiveAsSecondary from those states. Furthermore, we have a safe, lock-free way to answer the question, which is to consult the value of ReplicationCoordinatorImpl::_canServeNonLocalReads. That value only transitions from false to true synchronously with actions taken by tryToGoLiveAsASecondary, so will never return a spurious true. A spurious false is not important, because we will simply try again soon, staying in RECOVERING slightly longer than would otherwise be necessary. A quick prototype indicates that this eliminates the blocking of inserts while long deletes are running. | |||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 23/Dec/14 ] | |||||||||||||||||||||||||||
|
schwerin, based on the call tree above it doesn't appear to me that the TTL deletions are spending much of their time in a part of the call tree that I would have thought would require that they serialize access to that data structure, unless perhaps they are holding that lock (much) longer than necessary. | |||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 23/Dec/14 ] | |||||||||||||||||||||||||||
|
Yes, as suggested TTL is not key - mass deletes will do the same thing.
As with TTL mass deletes starve inserts in the case of a replica set:
Without the oplog the inserts are not throttled in the mongod layer, and appear to compete in the wt layer on more-or-less equal footing with the deletes, judging from the wt op counters (A markes the point where the deletes begin):
Why does the presence of the oplog together with deletes create a bottleneck, apparently in the mongod layer (based on the mongod-layer lock seen in the stack tree above), that completely starves inserts? If it's a question of additional work, what additional work does the oplog create for deletes that so far exceeds the amount of additional work created for inserts that the deletes essentially starve the inserts to zero? | |||||||||||||||||||||||||||
| Comment by Andy Schwerin [ 23/Dec/14 ] | |||||||||||||||||||||||||||
|
Notably, the oplog is a serialization point. There is a small amount if in-memory data that must be accessed serially. IIRC, it's for generating new timestamps that meet the oplog ordering requirements. | |||||||||||||||||||||||||||
| Comment by Scott Hernandez (Inactive) [ 23/Dec/14 ] | |||||||||||||||||||||||||||
|
And the replicaset/single-node behavior is just the cost of the doing the replication operations. A replica primary does more work per document and you get longer blocks, and less other work getting in there.... | |||||||||||||||||||||||||||
| Comment by Scott Hernandez (Inactive) [ 23/Dec/14 ] | |||||||||||||||||||||||||||
|
Does this really have anything to do with TTL, or is it really about massive deletes starving other write operations? Can you repro this issuing the same delete manually, and not using TTLs? Also, if that proves fruitful then replacing remove/delete with a multi-update of the same subset of data, might also prove to cause the same behavior. .... oh, just realized this is because WiredTiger doesn't yield... |