[SERVER-44092] Orphan documents impact IXSCAN performance Created: 18/Oct/19  Updated: 06/Dec/19  Resolved: 06/Dec/19

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 4.0.10
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Peter Ivanov Assignee: Eric Sedor
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
is related to SERVER-38972 mongos takes forever for multi shard ... Closed
Operating System: ALL
Participants:

 Description   

This happens once in a while. So far, no clear steps were found to reproduce this, but the symptoms are clear enough on their own. 

This issue looks like this: some of mongod instances start to have an abnormally high CPU usage. They have the same amount of data as other shards, the same count and shape of operations. This may go as bad as near 100% cpu usage, but mongod instance still woks, albeit with increased latency. 

In all cases we found the same offending query pattern: there are some frequent reads to a collection with, let's say, "foo == 1" query. The returned documents are updated with "foo = 2" and updated in db, then read query is repeated. Normally, this query often returns zero documents, and this doesn't upset mongod very much. When the issue takes place, however, this query still returns zero documents, but takes much longer to do so.

Some prior experience with orphaned documents gave us a degree of insight, so we tried and cleaned up orphans in the problematic collection. All performance effects ceased at once. No other side effects were observed. 

The effect presented itself several times, and each time this workaround was just as successful. At once such occasion we degraded state lasted some 12 hours, so this is not something that resolves itself - and therefore this doesn't look like a healthy last stage of chunk balancing, for example. Orphans are left behind for good. Or rather, for the bad. 

The issue seems random and damaging enough to warrant some sort of user-side workaround, like cron-based orphan cleanup on all collections. Clearly, such measures must not be the only way to fix the issue. 



 Comments   
Comment by Eric Sedor [ 06/Dec/19 ]

petr.ivanov.s@gmail.com,

I'm going to close this ticket for now but I do encourage you to write back if explain plan information becomes available.

Thanks a lot!
Eric

Comment by Peter Ivanov [ 24/Oct/19 ]

Thank you for clarification, Eric. We'll come back with more info if we can. 

Comment by Eric Sedor [ 23/Oct/19 ]

To clarify, there is extra work to do to avoid returning orphaned documents in query results (There are some details in a recent ticket, SERVER-38972). This has contributed to the desire to limit their existence entirely. In the meantime, explain plans might provide us with evidence of specific bugs that we're not aware of. I'm going to keep this ticket open for a while in case you are able to provide these.

One potential workaround (before 4.4) aside from the automation of cleanupOrphaned you mentioned is to look into what has prompted the orphans in the deployment to begin with. For assistance troubleshooting the creation of orphans, I encourage you to ask our community by posting on the mongodb-user group or on Stack Overflow with the mongodb tag.

Gratefully,
Eric

Comment by Peter Ivanov [ 23/Oct/19 ]

Hi Eric. 

Possibly, we could. This doesn't happen all that often, though, and this won't be possible at all in case we implement some sort of a workaround in a manner described above. 

But the query is actually exactly as simple as mentioned above. All you need is a index for a field, a query with a specific value or a range for this field. I believe that if you artificially create a lot of orphans in a state when this field matches a query, and then modify the actual moved documents so that they don't satisfy it any more, then a request via mongos will reproduce exactly the issue I'm describing. 

 

From user's perspective, though, a mechanism to properly ensure the cleanup of all orphans would solve this issue completely. It depends, of course, on exactly how 'eventually' it will happen. 

Comment by Eric Sedor [ 23/Oct/19 ]

Hi petr.ivanov.s@gmail.com,

This is Eric chiming in to ask for some more information. As my colleague dmitry.agranat already pointed out, we are working to ensure sharded clusters clean up orphans on their own.

However, I think it would help if we understood exactly where the poor performance is coming from in this case. For that we need to understand exactly what query and index are being impacted by the existence of orphan documents, and in what way.

If/when this happens again, can you please collect explain("executionStats") output for the query both before and after cleaning up orphaned documents?

Gratefully,
Eric

Comment by Dmitry Agranat [ 22/Oct/19 ]

Hi Peter,

We potentially plan to address this issue starting with MongoDB 4.4, the server will ensure that eventually there are no more orphaned documents.

Thanks,
Dima

Comment by Eric Sedor [ 21/Oct/19 ]

Thanks petr.ivanov.s@gmail.com; we will take a look at this submission, and may have some follow up questions.

Eric

Generated at Thu Feb 08 05:04:59 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.