[SERVER-44092] Orphan documents impact IXSCAN performance Created: 18/Oct/19 Updated: 06/Dec/19 Resolved: 06/Dec/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 4.0.10 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Peter Ivanov | Assignee: | Eric Sedor |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
This happens once in a while. So far, no clear steps were found to reproduce this, but the symptoms are clear enough on their own. This issue looks like this: some of mongod instances start to have an abnormally high CPU usage. They have the same amount of data as other shards, the same count and shape of operations. This may go as bad as near 100% cpu usage, but mongod instance still woks, albeit with increased latency. In all cases we found the same offending query pattern: there are some frequent reads to a collection with, let's say, "foo == 1" query. The returned documents are updated with "foo = 2" and updated in db, then read query is repeated. Normally, this query often returns zero documents, and this doesn't upset mongod very much. When the issue takes place, however, this query still returns zero documents, but takes much longer to do so. Some prior experience with orphaned documents gave us a degree of insight, so we tried and cleaned up orphans in the problematic collection. All performance effects ceased at once. No other side effects were observed. The effect presented itself several times, and each time this workaround was just as successful. At once such occasion we degraded state lasted some 12 hours, so this is not something that resolves itself - and therefore this doesn't look like a healthy last stage of chunk balancing, for example. Orphans are left behind for good. Or rather, for the bad. The issue seems random and damaging enough to warrant some sort of user-side workaround, like cron-based orphan cleanup on all collections. Clearly, such measures must not be the only way to fix the issue. |
| Comments |
| Comment by Eric Sedor [ 06/Dec/19 ] |
|
I'm going to close this ticket for now but I do encourage you to write back if explain plan information becomes available. Thanks a lot! |
| Comment by Peter Ivanov [ 24/Oct/19 ] |
|
Thank you for clarification, Eric. We'll come back with more info if we can. |
| Comment by Eric Sedor [ 23/Oct/19 ] |
|
To clarify, there is extra work to do to avoid returning orphaned documents in query results (There are some details in a recent ticket, One potential workaround (before 4.4) aside from the automation of cleanupOrphaned you mentioned is to look into what has prompted the orphans in the deployment to begin with. For assistance troubleshooting the creation of orphans, I encourage you to ask our community by posting on the mongodb-user group or on Stack Overflow with the mongodb tag. Gratefully, |
| Comment by Peter Ivanov [ 23/Oct/19 ] |
|
Hi Eric. Possibly, we could. This doesn't happen all that often, though, and this won't be possible at all in case we implement some sort of a workaround in a manner described above. But the query is actually exactly as simple as mentioned above. All you need is a index for a field, a query with a specific value or a range for this field. I believe that if you artificially create a lot of orphans in a state when this field matches a query, and then modify the actual moved documents so that they don't satisfy it any more, then a request via mongos will reproduce exactly the issue I'm describing.
From user's perspective, though, a mechanism to properly ensure the cleanup of all orphans would solve this issue completely. It depends, of course, on exactly how 'eventually' it will happen. |
| Comment by Eric Sedor [ 23/Oct/19 ] |
|
This is Eric chiming in to ask for some more information. As my colleague dmitry.agranat already pointed out, we are working to ensure sharded clusters clean up orphans on their own. However, I think it would help if we understood exactly where the poor performance is coming from in this case. For that we need to understand exactly what query and index are being impacted by the existence of orphan documents, and in what way. If/when this happens again, can you please collect explain("executionStats") output for the query both before and after cleaning up orphaned documents? Gratefully, |
| Comment by Dmitry Agranat [ 22/Oct/19 ] |
|
Hi Peter, We potentially plan to address this issue starting with MongoDB 4.4, the server will ensure that eventually there are no more orphaned documents. Thanks, |
| Comment by Eric Sedor [ 21/Oct/19 ] |
|
Thanks petr.ivanov.s@gmail.com; we will take a look at this submission, and may have some follow up questions. Eric |