[SERVER-20876] Hang in scenario with sharded ttl collection under WiredTiger Created: 12/Oct/15 Updated: 10/Dec/15 Resolved: 12/Nov/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Performance, WiredTiger |
| Affects Version/s: | 3.0.6, 3.0.7 |
| Fix Version/s: | 3.0.8 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Bruce Lucas (Inactive) | Assignee: | Michael Cahill (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
Setup:
After about 25 minutes of run time (about 15 minutes after ttl deletions started) one of the shards became completely stuck. Stack traces captured after it became stuck are attached:
Here's an overview of the run for the shard that became stuck:
|
| Comments |
| Comment by Githook User [ 13/Nov/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}Message: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 12/Nov/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}Message: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 20/Oct/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
michael.cahill, agreed that changing how _id lookups are done to avoid the slow search_near fixes the particular scenario that this ticket reports. I opened | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kaloian Manassiev [ 19/Oct/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In response to bruce.lucas's question to renctan above, I just checked the code and the only operations in sharding receive chunk, which explicitly create WUOWs are the initial index and collection creation (both cases no data), so they should be very quick operations. The call stack above, which includes sharding migration lines is the receive part, which does one transaction per document. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 19/Oct/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I believe Michael was referring to the "transaction rage of IDs pinned" statistic, for example in this graph from above:
Note how it climbs linearly while the TTL remove calls are executing, for the duration of the execution. If I understand correctly this is a sign that some transaction is open that requires some transactions to be pinned, presumably the write transactions for the TTL in this case (because that's all that's happening in that graph at that rate). The question is what transaction is open that requires the write transactions to be pinned: Is the TTL index query by any chance done as a single transaction that is open as we do the deletes? Is that transaction closed when we yield? In some cases in fact I have seen the range of ids pinned reset to 0 during the middle of a TTL pass, giving a "double sawtooth" for that pass (and I think you can just barely see that happening immediately after A in that graph). Would this indicate that the TTL pass yielded at that point? Is the issue maybe that we aren't yielding enough? (To be clear, whatever issues that might cause, I believe that is a separate issue and is not responsible for the problems reported on this ticket, because a) those problems occur long after the TTL activity has ended and no transactions are pinned, and b) the problems seen here occur whether the documents are deleted by TTL or by ordinary application removes). | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Geert Bosch [ 19/Oct/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
michael.cahill, there should never be many deletes happen in a single transaction. Each WriteUnitOfWork should commit separately and thus cause a new transaction to be started. Separately, we do yielding every so often. Can you elaborate on why you think 3M objects are deleted in a single txn? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 19/Oct/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
bruce.lucas, I agree that there is more work to do, but given that sharding does not do lookups in indexes other than _id, the patch build should address the most common cases where this issue has been seen. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 19/Oct/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Don't we also need to look into the underlying reason the search_near becomes slow after deleting records from the collection, whether with TTL or with ordinary removes? The patch fix indeed avoids that issue for _id lookups, but not for more general index operations. A slight tweak to the previous repro to use a (non-unique) secondary index shows the same problem with lookups done via that index; run using the patch build:
Typical stack during the slower index lookups after the removes shows same underlying issue with slow search_near, due to tree walks, yet the table is emtpy at this point:
Repro follows; no TTL, no sharding, nothing pathological - just remove stuff from a collection and then do lookups via a secondary index.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 19/Oct/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Running bruce.lucas's repro, I see an ~85% slowdown between the initial "find" steps and the steps after the insert/remove. With the patch build mentioned above, there is no slowdown. I'll push that change into CR. It is still important to understand which operation in MongoDB is keeping a transaction pinned across millions of TTL remove operations, because that is likely to cause other problems as well, but the change from search_near to search for _id lookups does seem to address the symptom identified here. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 18/Oct/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Above was with 3.0.7. I tried it on 3.2.0-rc0, but ran into | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 18/Oct/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Simple standalone repro:
Stack samples covering same time period as above, showing that the difference between the first and the second set of findOnes is the tree walk:
Here's the repro script:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 16/Oct/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
How much memory should one expect an _id index table that is full of nothing but tombstone markers to use? After I do the findOne() to get rid of the tombstone markers in the collection table, and verify that the slow performance lookups in the _id index table are still in progress, the WT statistic reports that bytes currently in the cache drops from about 1.6 GB to essentially 0 (1.9 MB to be exact). The documents in the collection were tiny. Is it plausible that the tombstone markers in the _id index table would take up so much less space (1.9 MB) than tombstone markers in the collection table (1.6 GB)? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 16/Oct/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Agreed that while changing to use search instead of search_near for _id lookups sounds like a good thing and would ease the symptoms, it may leave other uses of the index vulnerable to the underlying problem, and we should address that as well. Also while the TTL behavior also may not be optimal, I don't think it's the underlying problem either since the problem (low performance of _id lookups) persists after the tables have been completely emptied and TTL activity has ceased. How can we 1) determine definitively whether the problem is unevictable tombstones in the _id index, and 2) whether it is due to a long-running transaction, and if so 3) who is opening the transaction? (Regarding the experiment I reported on above: I believe the behavior when I did the findOne was a red herring. I repeated the experiment doing the findOne while the chunk move was stuck instead of waiting until after it became unstuck, and it showed the same behavior: first time was slow, second time was fast, and the cache bytes fell. I think this is because findOne does not use the _id index table but only the collection table, so the findOne only cleared out the tombstones in the collection table, which had not been touched by the running recvChunk because it was consulting only the index table. Conversely the findOne had no effect on the running recvChunk transaction, nor did any other query, e.g. a query that traversed the _id index. So the net is I now think that experiment didn't tell us anything about the cause of the performance issue with _id lookups, other than perhaps to show that whatever is preventing tombstones from being removed from the _id index table is not causing a similar problem with the collection table). | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 16/Oct/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I tried a slightly different approach to making this path do a search in 3.0. There is a patch build here: I haven't yet had a chance to run bruce.lucas's reproducer against this build – I'm hopeful that it may ease the symptom, but I hope we can also get a better understanding of the underlying problem. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alexander Gorrod [ 16/Oct/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
I took a run at making that change. It's possible to make it work, but there is code that expects a failed call to _locate to leave the cursor positioned. I coded around about 4 places that have expectations about search semantics, but there were more. My feeling is that switching from a search_near to a search for all queries on a unique index would be possible but it is an invasive change. It's simple to try a search in the unique code and fall back to a search_near if the key isn't found. The cost there is if a search is done for a non-existant entry we pay the cost of an additional search. The storage_wiredtiger_index_test program uncovers lots of such cases. Thinking more about this alternative, I don't think it will help in the users case - they are doing searches for keys that don't exist, otherwise we wouldn't see next/prev calls in the profiler traces. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 15/Oct/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
bruce.lucas, one thing I didn't address:
The tombstones should become candidates for eviction as soon as the pinned range goes to zero. They should be evicted either by cache pressure (if the cache is full), or the next time a cursor moves over them, assuming there isn't some other thread blocking the eviction. We do build in some delays to avoid repeatedly retrying eviction if it fails, but I would not expect that to block eviction in this case. In summary, I would also expect to be able to evict tombstones by "B" at the latest. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 15/Oct/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
bruce.lucas, the pinned ID statistic is related to running transactions, not data in cache. WiredTiger maintains an "oldest ID", which is updated regularly to track the oldest transaction in the system. The statistic indicates how many update transactions have started since. The reason why I had assumed that TTL was responsible was the shape of the pinned transaction graph – it drops as soon as the TTL pass completes (at "C" in your first graph). I could of course be reading it wrong and the migration could have a transaction active and be blocked behind TTL. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 15/Oct/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
geert.bosch, I know you're looking at the 3.2 TTL code in What I'm seeing from the WiredTiger side in 3.0 is that a transaction is running for the whole of a TTL pass. In bruce.lucas's repro, that means ~3 million remove operations have to be buffered in cache. It looks like the deleteObjects call should yield and split the transaction into pieces (because YIELD_AUTO is set), and there is a separate WriteUnitOfWork around each operation in DeleteStage::work. Do you have any ideas about why a transaction could be running for the whole time? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 15/Oct/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I wonder if the issue is not be a long-running ttl transaction, but rather a long-running transaction related to receiving the long-running recvChunk. Tried the following experiment:
I didn't try doing a findOne during the chunk migration (between C and D) to confirm that it didn't clear out the tombstones, but the stack traces above and the nature of the operations I believe indicate that it would have cleared the tombstones if it could. So my suspicion is that it might be the chunk migration itself that is keeping open a transaction, and it is that very transaction that is preventing the deletion of the tombstones that are slowing it down, so that as long as operations proceed on the donor shard it will never complete. renctan, the long-running operation in question here is the catch-up phase of receiving a chunk. A typical stack follows. Do any of the functions on this stack hold open a WT transaction for the duration?
michael.cahill, one question: if a large number of tombstones can't be deleted because of an open transaction, why doesn't that show up as a large range of pinned ids for the duration of the time that the transaction is open, which would be presumably A-D? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Michael Cahill (Inactive) [ 15/Oct/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
In the problematic stack, findSingle for the _id lookup is calling WiredTiger's search_near operation. That operation attempts to find a matching key "close" to the search key. If it lands on a deleted key, it first tries to step next then if that doesn't find a visible record, it does a prev operation to step backwards. Unfortunately, the API and implementation mean that if there is a huge table and all the records are deleted, each search_near will traverse the whole table before finally returning WT_NOTFOUND. While not a hang, exactly, that's going to be very slow. The reason those deleted records can't be evicted (so the tombstones remain in the system, and the query has to keep skipping over them) is because there is an old transaction ID keeping updates pinned in cache. It looks like it must be the TTL monitor that is running in one big transaction. Possible solutions:
Update: I note that MongoDB 3.1+ does search on unique indexes rather than search_near. That may contribute to this issue not reproducing on recent development releases. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 13/Oct/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Repro script below. This was run on a machine with 12 cores / 24 cpus, 64 GB memory. You may need to adjust cache and/or number of threads for a smaller machine. However note that it seems to be load-dependent: with threads set to 1 it did not reproduce.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 13/Oct/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Version survey:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 13/Oct/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The repro seems to be reliable, with one of two outcomes:
Detailed analysis of an incident of the first type: A migration to shard 0 got stuck apparently indefinitely - from currentOp
Message means it has completed step 3, which is the bulk clone, and now it's in the catch-up phase where it is applying ops that have happened while it was cloning. Here are the last messages in mongod log, which are the ones related to the stuck migration:
(The deleting is about removing anything on that shard that may happen to already be there in that chunk range in preparation for moving the chunk there; as expected nothing was deleted.) Here's the range being migrated from shard 1 to shard 0:
Note that the timestamp for the _id marking the beginning of the range is about the same time the chunk move began, so the chunk being migrated is the current "hot" chunk i.e. the chunk receiving all the current ObjectIDs, presumably being migrated proactively from shard 1 to shard 0 so it doesn't have to be migrated later when it has gotten big. Here are some stats relating to that migration:
Taken together the above are a signature of this issue. Here are some selected stack traces showing activity at each call site for that same timeline, showing the same landmarks A, B, C:
I wonder if the issue is that the ttl deletions on this shard have made those _id lookups slow due to an accumulation of deleted items, and so the to shard is not able to process the catch-up ops as quickly as they are being generated on the from-shard, so it never catches up. Note in fact by the way that the collection has been entirely emptied by ttl deletions (confirmed as mentioned above that collection count on the stuck shard is 0), so all we are doing it appears is skipping over deleted items at this point; so why are we unable to get rid of the deleted items? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 12/Oct/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Another repro attempt with gdbmon running the whole time did not hit the issue. However it does have some information about why the insert rate drops so dramatically after ttl deletions start: we are spending a lot of time in splitVector (for reasons tbd), and mongos calls splitVector sequentially with insert, so insert performance drops. You'll note that some of this time is in wt, but there is also substantial time (if not more) being spent in the mongod layer in things like toBson and woCompare. Not clear why that is the case, also not clear if this is really related to the hang issue. Files attached as repro-04-*. Regarding the apparent interaction between the balancer and ttl deletions: suppose we move an old chunk to a shard at the same time new data is being inserted. The end result I think would have the old and new records interspersed (possibly even interleaved) in the wt table, and then when we delete the old records we will be deleting every other record (or something similar). Perhaps this is a case that is problematic for WT? | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 12/Oct/15 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Here's a run of the same ttl workload standalone (complete stats as standalone-diagnostic.data.tar):
One difference between the sharded and non-sharded situation is that the collections in the document will have been moved (assuming the balancer is on) from one shard to another; perhaps those moves leave the collecton in a state such that a) deletes from the collection decline in performance throughout a ttl pass, and also dramatically reduce the performance of the inserts and b) makes it prone to the hang. For example, the chunk moves may have disordered the documents in the ttl index and/or collection. |