[SERVER-22611] ChunkManager refresh can occasionally cause a full reload Created: 15/Feb/16 Updated: 08/May/18 Resolved: 12/Mar/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 3.2.12, 3.4.2 |
| Fix Version/s: | 3.4.4, 3.5.5 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Yoni Douek | Assignee: | Kaloian Manassiev |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Backport Requested: |
v3.4
|
||||||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Sharding 11 (03/11/16), Sharding 12 (04/01/16), Sharding 13 (04/22/16), Sharding 14 (05/13/16), Sharding 15 (06/03/16), Sharding 2017-03-27 | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Linked BF Score: | 0 | ||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
And can block other operations since it will take the DBConfig mutex. This happens when the chunk differ got an unexpected result from the config server (see here). This can potentially occur when yield occurs while querying the config server. Original Title: Chunk migration freezes all mongos servers for >60 seconds Original description:
|
| Comments |
| Comment by Ramon Fernandez Marina [ 24/Aug/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'username': u'kaloianm', 'name': u'Kaloian Manassiev', 'email': u'kaloian.manassiev@mongodb.com'}Message: | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 24/Aug/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'username': u'kaloianm', 'name': u'Kaloian Manassiev', 'email': u'kaloian.manassiev@mongodb.com'}Message:Revert " This reverts commit ae2518adace4ba7ed6a16eba6943bff6ea4ade10. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 12/Apr/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}Message: This change gets rid of the "chunk differ" which was previously shared (cherry picked from commit b1fd308ad04a5a6719fe72bcd23b10f1b8266097) | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 12/Apr/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}Message: Instead of calling its internal logic directly. (cherry picked from commit 84d94351aa308caf2c684b0fe5fbb7f942c75bd0) | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 12/Apr/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}Message: (cherry picked from commit 758bc2adcf2c83363d0fdfdef0cbd1cf3c800e62) | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 11/Apr/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}Message: | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 11/Apr/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}Message: (cherry picked from commit 39e06c9ef8c797ad626956b564ac9ebe295cbaf3) | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Clive Hill [ 06/Apr/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Understood - thanks Ramon | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 06/Apr/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
The current schedule for 3.4.4 is end of April, but there's no guarantee this bug will be fixed in 3.4.4. Please also note that this is not a commitment on our part and schedules are subject to change. Regards, | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Clive Hill [ 06/Apr/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks Ramon, I misread the ticket. I see now it says affects 3.4.2; I thought it was fixed in that version. Are you able to give estimate as to when 3.4.4 will be available? (Previous releases seem to be around 4 to 6 weeks, hence as 3.4.3 was available on 27th March, do you think by mid-May is realistic?) | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 06/Apr/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
EvilChill, this issue is scheduled for inclusion in a future 3.4 release, but is unfortunately not available in 3.4.3. If you're watching this ticket, you'll see the "3.4 Required" fixVersion change to 3.4.X (where X >= 4) when the fix lands in the v3.4 branch. Regards, | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Clive Hill [ 06/Apr/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
I see this issue with MonoDb 3.4.3 with Java driver 3.4.2. Should this be fixed now? | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 04/Apr/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}Message: This change gets rid of the "chunk differ" which was previously shared | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 31/Mar/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}Message: | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 31/Mar/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}Message: Instead of calling its internal logic directly. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 20/Mar/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}Message: | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 12/Mar/17 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}Message: | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andy Schwerin [ 21/Apr/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
The design of the code that updates the ChunkManager (routing table) appears to heavily rely on the assumption that this scenario will be rare in order to keep the code simple and correct. It's going to take a substantial rewrite of that code to remove that assumption, which means higher risk of introducing undetected correctness errors. I'm looking into design alternatives, but it may not be advisable to back port a change of this magnitude to the 3.2 release branch. I'll also check around to see if other users are experiencing similar problems. In any event, I'll keep this ticket up to date. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Yoni Douek [ 19/Apr/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Any news? | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Randolph Tan [ 04/Apr/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Update: I was able to successfully write a repro script that demonstrates the bug where a full reload more than once for a single request. Explanation of how mongos reads the full config.chunks for a collection more than once: 1. New ChunkManager gets instantiated, with _version initialized from config.collections (ref) Now that the collection info has a zero shard version, mongos will send a zero version when it talks to the shard. The shard will reject it and returns the desired shard version. Since the epoch will never match, mongos will perform another full reload again (and risk hitting the same bug again). Conditions to happen:
Attached double_reload.js demonstrating this bug. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Randolph Tan [ 31/Mar/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
I would like to share some updates. I was trying to write a more focused test script and was not able to successfully reproduced the problem. After digging more, I found out that my initial investigation was incorrect. The original test script attached to this ticket did not actually reproduce the problem, it only exhibited symptoms similar but was never actually doing a full reload because of ChunkManager refresh. I also realize that the current code already handles the case when the chunk differ can get overlapping chunks because of yielding:
We are going to re-examine this ticket again and find the real cause. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andy Schwerin [ 26/Mar/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
We know the sequence of events that cause the problem, as Randolph | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Yoni Douek [ 26/Mar/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Any news? | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Randolph Tan [ 17/Mar/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi, It looks like you have the right indexes. Thanks! | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Yoni Douek [ 17/Mar/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Randolph Tan [ 16/Mar/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Yoni, Is it possible to post the list of indexes in config.chunks collection? This can be performed by connecting to mongos using the shell and issuing this series of commands:
Thanks! | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Randolph Tan [ 16/Mar/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
More detailed on the explanation of this bug: 1. Mongos gets a stale config exception and decides that it needs to update it's internal view of the routing table. We are currently discussing options for fixing this issue and I'll update the ticket again once we decided what to do. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Yoni Douek [ 16/Mar/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Great. mmapv1 for the entire cluster, including everything. I mentioned this in the bug description. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Randolph Tan [ 15/Mar/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Attached repro script test.js + diff file for failpoint. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Randolph Tan [ 15/Mar/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi, I was able to reproduce the issue and I have also confirmed that this is also an issue in v3.0. I am trying to figure out why it happens more often on your v3.2 setup. Were you using the WiredTiger storage engine on the config server for both your v3.0 and v3.2 cluster? Thanks! | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Yoni Douek [ 15/Mar/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
If monogodump is still necessary, let me know, but we'll need to find a private way to share it. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Yoni Douek [ 15/Mar/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks. cfg x 3 attached. They are mirrored. There was a single additional mongos (2 total), attached as well. We are 99% confident this bug started happening in 3.2.x and not in 3.0.x. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Randolph Tan [ 14/Mar/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi, I noticed in the logs where both mongos and the shards try to load the full chunk metadata for mydomain.Sessions because some mongos sent find request with a zero shard version. And the load in the shard took about 6 seconds, and it roughly coincides with the hang period you mentioned. I am still trying to figure out how this can happen, but for the mean time, is it possible to upload the logs for the 3 config servers and the dump (using mongodump)? And if there are not a lot of mongos, is it possible to upload the logs for all mongos as well? Thanks! | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Yoni Douek [ 13/Mar/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Logs attached. To clarify: it doesn't happen only when the primary is demoted, it happens all the time when balancer is active. Demotion its a different bug, please read above. You can see the "based on empty" in the mongos log attached. All running 3.2.4. mongos froze for 6-11 seconds, in the following times: when it happens its usually such a "cluster" of freeze times. Files: You should have everything you need. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Randolph Tan [ 10/Mar/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi, Is it possible to upload one of the logs (not just snippets) for the mongos and the primary of the shard that got demoted when this happened? Thanks! | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Yoni Douek [ 10/Mar/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Looks like it still happens in 3.2.4, but maybe for shorter periods (so far we got it to freeze "only" for 6 seconds). How can we help you help us ? : ) Reminder: our setup is 3 shards, each one P+S+A, 3 config servers, 2 mongos servers. mongos are the component that freeze, so plz be specific on the diagnostic data that you need. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 09/Mar/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Looks like this behavior is still present in 3.2.3, so we'll continue to investigate. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kaloian Manassiev [ 19/Feb/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
Hi yoni@appsee.com, Thank you for the detailed report and apologies for problems this issue is causing you. From the symptoms that you are experiencing (only mongos gets stuck) and also from the fact that you have seen this behaviour when the primary steps down I have strong suspicion that you might be seeing a manifestation of The fix for this bug is available in v3.2.3. If you haven't already, would it be possible that you upgrade to this version and see whether the problem goes away? Best regards, | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Yoni Douek [ 18/Feb/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
The bug described in the last sentence happened also in 3.0.2. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Yoni Douek [ 18/Feb/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
3.2, mirrored. Any workaround will be appreciated, we currently can't move any chunks. | ||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ramon Fernandez Marina [ 15/Feb/16 ] | ||||||||||||||||||||||||||||||||||||||||||||||
|
yonido, can you please clarify which version(s) of MongoDB is this cluster running? If it's 3.2, are your config servers using a mirrored configuration or a replica set? |