[SERVER-38972] mongos takes forever for multi shard queries Created: 13/Jan/19 Updated: 27/Oct/23 Resolved: 04/Feb/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 4.0.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | James Hartig | Assignee: | Danny Hatcher (Inactive) |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||
| Issue Links: |
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Steps To Reproduce: |
|
||||||||||||
| Participants: | |||||||||||||
| Description |
|
If mongos needs to hit multiple shards then the query stalls but if I run the query on both shards independently they finish instantly. I have a sharded collection on r,u. Additionally, if I run a command on the mongos that only needs to query a single shard it finishes instantly as well. "conn34" is the affected connection in the logs. It looks like it returns from a single node very quickly and the other server is the slow one. I attached the operation from that server by running db.currentOp(true). |
| Comments |
| Comment by Danny Hatcher (Inactive) [ 04/Feb/19 ] | |||||||||||||||||||
|
Hello admiraljames, After an internal discussion, our conclusion is that while the behavior of orphans causing long queries is undesired, it is technically working as designed. To solve the situation long-term, we are looking into ways to improve the range deleter but nothing is currently scheduled. In the short-term, the cleanupOrphans command is the proper way to remove orphans to allow queries to run in a reasonable timeframe. Based on the above, I am going to close this ticket as "Working as Designed". Thank you for bearing with us during this discovery process! Thank you, Danny | |||||||||||||||||||
| Comment by Danny Hatcher (Inactive) [ 22/Jan/19 ] | |||||||||||||||||||
|
Hello James,
When a normal query is run, it goes through our Query Planner process. The first time the query occurs, the Planner compares various indexes and determines the one that should best serve that query shape. Once an index is chosen, it is cached for future invocations of that query shape. This helps reduce the overhead of going through the Query Planner for repetitions of the same query. The cache is flushed on mongod restarts or index drops as changes in the system may push a different index ahead in terms of performance. The cache is also regenerated when a cached plan has been used a large amount of times so that the system is never stuck using a bad index for too long. This seems to be what happened on "phil".
Yes, this is exactly what happened. Because we want the explain to show all the different plans that the Query Planner attempts to run for the specific query, we intentionally bypass whatever plans are cached. If the explain shows one plan and the actual query uses another (worse) one, the best step is to manually flush the plan cache.
I agree, but I'd like to run through it internally once more before confirming.
It is possible for the Query Planner to cache a suboptimal index if the sample it takes is not indicative of the true performance of the query. However, a flush of the plan cache should ideally cause the Planner to choose the optimal plan the next time the query is run. If you encounter a reproducible instance where the Planner chooses the plan to cache, please open a new ticket. Thank you, Danny | |||||||||||||||||||
| Comment by James Hartig [ 22/Jan/19 ] | |||||||||||||||||||
|
Hey Daniel, Thanks for following up. > Per your latest comment, am I correct in saying that queries through the mongos now return return quickly by running against a combination of "queen" and "zeta"? And that the only time the query is slow is when it is run directly on "phil" (which is now a Secondary)? Yes, that was the state it was in when I last commented. > If the above is true, can you run the query directly against "phil" but specify a hint of the proper index? If I hinted the proper index then it would work fine. > If the query runs quickly with the hint but slowly without, could you clear the query plan for the shape and try to run the test query again? This is very interesting because I didn't realize there was a cache. The problem is no longer prevalent on "phil" and I can only imagine it is because the cache was at some point automatically cleared or expired. Before I mentioned: > It seems like when I do the explain it uses the "r_1_u_1_pid_1_oid_1_incr_1" on phil, as it should, but when I actually run the query, it prints out: "planSummary: IXSCAN { r: 1, e: 1, ss: 1, tsc: 1, tslp: 1 } keysExamined:15661823 docsExamined:15661823". Could it be that the explain(true) doesn't use the plan cache but when I run the query (without explain) then it does use the cache and that would explain the discrepancy I encountered earlier? I think we can summarize that the whole issue was caused by the orphaned documents on phil. It still seems like it was choosing the wrong index which caused the query to take longer than necessary which might be a separate bug? Let me know how you want to proceed. I'm also unsure how I should prevent what happened in the future when I need to add new shards and a bunch of orphaned documents exist on the old shard(s). | |||||||||||||||||||
| Comment by Danny Hatcher (Inactive) [ 22/Jan/19 ] | |||||||||||||||||||
|
Hello James, Thank you for keeping us updated with your cluster's situation; it is immensely helpful. Per your latest comment, am I correct in saying that queries through the mongos now return return quickly by running against a combination of "queen" and "zeta"? And that the only time the query is slow is when it is run directly on "phil" (which is now a Secondary)? If the above is true, can you run the query directly against "phil" but specify a hint of the proper index? The hint will override the query optimizer and will use the index specified. If the query runs quickly with the hint but slowly without, could you clear the query plan for the shape and try to run the test query again? Thank you, Danny | |||||||||||||||||||
| Comment by James Hartig [ 18/Jan/19 ] | |||||||||||||||||||
|
Hey Danny, Since the secondary in the same replica set as phil doesn't have any issues running that query, I promoted it to be the primary for the time being while we are debugging. I thought the issue might be that I need to compact phil after the orphaned documents were deleted and that might be the issue but looking at the indexes between phil and zeta (the new primary that doesn't have the issue), I don't see much of a difference:
Even after promoting zeta and stepping down phil, the issue still persists on phil and that query takes multiple minutes to return. | |||||||||||||||||||
| Comment by James Hartig [ 18/Jan/19 ] | |||||||||||||||||||
|
Hey Danny, The orphaned documents have finished being cleaned up but it still seems like there's an issue, the explain() now returns instantly but if I actually run the "limit" call it takes multiple minutes when run via mongos or locally on phil. Query:
The above seems to return instantly via queen locally and via mongos but phil still has an issue with it. I've attached the explain(true) from both servers when running the query locally. I'm also reuploading the diagnostic.data from phil as well as the log file from when I was running the queries (including 100% slow query reporting). I'll append "-post" to the file names to differentiate them from before the orphaned cleanup finished. It seems like when I do the explain it uses the "r_1_u_1_pid_1_oid_1_incr_1" on phil, as it should, but when I actually run the query, it prints out: "planSummary: IXSCAN { r: 1, e: 1, ss: 1, tsc: 1, tslp: 1 }keysExamined:15661823 docsExamined:15661823". I don't know why it would've picked that index at all, especially since the explain didn't show it using that and that doesn't even contain the "u" field in it. | |||||||||||||||||||
| Comment by James Hartig [ 16/Jan/19 ] | |||||||||||||||||||
|
Hey Danny, There's still a few days left so I'll update once it's done. I can see why having to scan and discard a lot of orphaned docs would make the query slow, but ideally it could use the shard info along with my query to only target the specific docs. For instance, "phil" should only have:
So theoretically it should transform:
into:
Which returns instantly locally AND through mongos and uses the "r_1_u_1_pid_1_oid_1_incr_1" index. > Were you using 3.6 at any point for this cluster or has it always been on 4.0? No, this cluster is only a month old, so it started with 4.0.4 and was recently updated to 4.0.5. > Additionally, do you know how many chunks are in your cluster?
Specifically, queen should have 2078 shards and phil should have 2075.
I guess the bug should be altered at this point since the problem isn't with mongos, it seems like it's just a problem with the query needing to skip a bunch of orphaned documents. This means whenever we add a new shard replica set to a range and the balancer starts to migrate half of the chunks to it, the old replica set will be severely degraded for multiple days (or weeks) until the range deleter has a chance to delete all of the orphaned documents. Hopefully you agree with me and consider that a bug.
| |||||||||||||||||||
| Comment by Danny Hatcher (Inactive) [ 16/Jan/19 ] | |||||||||||||||||||
|
Hello James, Were you using 3.6 at any point for this cluster or has it always been on 4.0? If it was previously on 3.6, do you know if you ever experienced issues like this? Additionally, do you know how many chunks are in your cluster? You can run the following command against a mongos to get a simple count of them:
Thank you, Danny | |||||||||||||||||||
| Comment by Danny Hatcher (Inactive) [ 16/Jan/19 ] | |||||||||||||||||||
|
Hello James, My colleagues have pointed me to a very similar problem we had on one of our internal clusters. The symptoms were the same and the cause in that case was a large amount of orphaned documents on one shard. As you're already clearing out those documents, let's see what happens when that's finished. Thank you, Danny | |||||||||||||||||||
| Comment by James Hartig [ 16/Jan/19 ] | |||||||||||||||||||
|
> From the most recent two explains you posted, both the mongos run and the local mongod run are using the same index: Woops, I might've been confused. When running locally on "phil" and "queen" they still use the wrong index but it returns instantly. I thought before they were using the right one but I can't seem to produce that anymore. Either index seems to result in the same amount of totalExaminedKeys (100 if limit 100, 10 if limit 10, etc) so I guess it doesn't matter. I'm not sure why they're not using the "r_1_u_1_pid_1_oid_1_incr_1" since that seems like the more optimal index but I guess the index isn't the issue with the slowness then. It must be the SHARDING_FILTER. > Over the course of the test query Queen is averaging 8,000 document scans per second while Phil is averaging 263,000: This is because of the orphaned docs. Phil is consistently looking at more documents than it needs to in order to filter out the wrong docs. This is actually caused by a separate query. > Could you move the sampling rate back up to 100% until we find the root cause? I can increase it back up to 100% and I can temporarily disable the queries that are causing the extra load. It should make the mongos query return faster but still multiple orders of magnitude slower than locally running it. I'll try and do that tomorrow and get more logs and diagnostic data. > It is possible that the combination of the range deleter and SHARDING_FILTER stage is the underlying issue so I will follow up internally on that. Keep me posted. Thanks! | |||||||||||||||||||
| Comment by Danny Hatcher (Inactive) [ 16/Jan/19 ] | |||||||||||||||||||
|
Hello James, From the most recent two explains you posted, both the mongos run and the local mongod run are using the same index: "indexName" : "r_1_ss_1_tsc_1_tslp_1_u_1" Where do you see the "correct" index being used locally? I can see that Queen is also struggling under load but less so than Phil. I believe your assessment is accurate; Phil has more documents than Queen which means more resources are required to service the same queries which means the system as a whole gets slower. Over the course of the test query Queen is averaging 8,000 document scans per second while Phil is averaging 263,000: However, while this would explain why queries take longer against Phil than they do against Queen it still does not explain why a query through the mongos would take dramatically longer than running against the both of the shards individually. It is possible that the combination of the range deleter and SHARDING_FILTER stage is the underlying issue so I will follow up internally on that. I understand that you most likely set the sampling rate to 20% in order to have smaller logs but I believe to accurately diagnose this problem we will need to see for certain what the mongod nodes are spending their time doing. Could you move the sampling rate back up to 100% until we find the root cause? > What is the best statistic to monitor the cache size and if its large enough? Your system would ideally have enough RAM allocated to the mongod that only rarely does the process have to go to disk to satisfy queries. The simplest way to determine if your cache is large enough to watch the iostat output and see if disk utilization remains at a low level. If it is consistently high, you most likely could allocate more RAM (or refine your indexes / schema). Products like Cloud Manager can make monitoring these statistics easier. Thank you, Danny | |||||||||||||||||||
| Comment by James Hartig [ 16/Jan/19 ] | |||||||||||||||||||
|
My guess after looking into it more is that because "phil" has ~50 million documents from an old shard that was transferred away but have not been removed yet, the query is looking at way more documents than it needs to. It's unclear to me why it's choosing the wrong index ("r_1_e_1_ss_1_tsc_1_tslp_1" vs "r_1_u_1_pid_1_oid_1_incr_1") but that is most likely causing it to look at an incredible amount of documents that the SHARDING_FILTER is then discarding. Any insight into why it's choosing that index instead of the "{r: 1, u: 1, pid:1, oid: 1, incr: 1}" index like it does locally when I run the query? Is it somehow because of the orphaned docs? I'm currently running the range deleter (as you can see in the currentOp I posted), which still has a few days left before all of the old shard docs are removed. | |||||||||||||||||||
| Comment by James Hartig [ 16/Jan/19 ] | |||||||||||||||||||
|
Hey Daniel, We actually have the slowOp threshold set to 1000ms but sampling at only 20% so that might explain why you didn't see it. If I look at current operations on "phil" while the mongos instance is hanging, I do see it running but I'm not sure why it would be taking so much longer when being run through mongos versus not. The difference in the resulting explain is that there's a SHARDING_FILTER when running via mongos as well as it using the wrong index as I mentioned in a previous comment. Could either of those be causing the query to take forever? I'm attaching 2 json files to this bug. One is the explain running locally (explain-local.json) and the other is when it is running via mongos (explain-mongos.json). Both are from the query: db.sessions4.find({r: "gce-us-east1", u: {$lt: "V"}}).limit(2).explain(). We are aware of the disk issue right now and plan to raise the cacheSize param for wired tiger once we fix a driver issue. We originally had to lower the cache size because a driver bug was causing an abnormally large amount of connections which caused mongod to OOM crash. What is the best statistic to monitor the cache size and if its large enough? I've uploaded the diagnostic and logs to the portal from "queen" shortly as well as the output from db.currentOp(true) when the query is running on "phil" (opid: 641601105) in case that provides any insight into why its taking so long. | |||||||||||||||||||
| Comment by Danny Hatcher (Inactive) [ 16/Jan/19 ] | |||||||||||||||||||
|
Hello James, The mongod information you provided shows that your disk is consistently at 100% utilization because not enough RAM is allocated to the process. This is definitely causing slowness in your system but is not necessarily the direct cause of the length of this query. I do not actually see an instance of the query show up in the mongod logs which indicates that the query itself is taking less than the 1000ms slowOp threshold you have set. Could you provide the mongod log and diagnostic.data for the "queen" Primary as well so that I can compare? Thank you, Danny | |||||||||||||||||||
| Comment by James Hartig [ 15/Jan/19 ] | |||||||||||||||||||
|
Thanks! I uploaded 4 files to that portal. The mongod and mongos diagnostic.data folders and the logs from the mongos and mongod. I was running the mongos in verbose logging mode but I didn't have that luxury for the mongod logs. This time the query db.sessions4.find({r: "gce-us-east1", u: {$lt: "V"}}).limit(2).explain() took over 27 minutes. | |||||||||||||||||||
| Comment by Danny Hatcher (Inactive) [ 14/Jan/19 ] | |||||||||||||||||||
|
Hello James, The diagnostic.data does not contain any sensitive information but I have created a Secure Upload Portal for you to use. Any information uploaded there will only be accessible by MongoDB Engineers and will be deleted automatically after 180 days. You can tar the "diagnostic.data" folder along with the logfiles and upload them there. In terms of time frame, right before the query through right after is the needed information but any time you can provide on either side may be helpful. If you perform regular log rotation, the full log file covering the time would be great; if you do not rotate your log files then perhaps a few hours before and after the query was run. Thank you, Danny | |||||||||||||||||||
| Comment by James Hartig [ 14/Jan/19 ] | |||||||||||||||||||
|
How long of a time period for the mongod logs, just starting right before the query and right after? What's the best way to submit the diagnostic.data folder? Should I just make a tar of it and post it here as an attachment? Is there sensitive information in it? | |||||||||||||||||||
| Comment by Danny Hatcher (Inactive) [ 14/Jan/19 ] | |||||||||||||||||||
|
Hello James, Could you also provide the mongod logs for the Primaries of the shards the query is hitting as well as the "diagnostic.data" folder for the mongod's and the mongos running the command? Thank you, Danny | |||||||||||||||||||
| Comment by James Hartig [ 14/Jan/19 ] | |||||||||||||||||||
|
Additionally, if I do a query that only targets either single shard (the split point happens to be "UxOVavlFZXtKRL5MnB+1uQ=="): db.sessions4.find({r: "gce-us-east1", u: {$lt: "U"}}).limit(2).explain() or db.sessions4.find({r: "gce-us-east1", u: {$gt: "Z"}}).limit(2).explain() They both return instantly, it's only when it the query includes both of the shards that it causes an issue. | |||||||||||||||||||
| Comment by James Hartig [ 13/Jan/19 ] | |||||||||||||||||||
|
Forgot to mention that the server that mongo takes forever waiting for has over 100 million orphaned documents because I recently added a new shard. Additionally, there is an index "r_1_u_1_pid_1_oid_1_incr_1", but it looks like that's in the rejected plans in the response when I run the query through mongos but if I run it locally on the server it uses that index. It should be able to use that index because the query is {r: "gce-us-east1", u: {$lt: "V"}}, right? Why did it decide to use "r_1_e_1_ss_1_tsc_1_tslp_1"? |