[SERVER-29603] Dropping indexes sometimes locks our entire cluster Created: 13/Jun/17 Updated: 24/Aug/17 Resolved: 16/Jul/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | 3.2.13, 3.4.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Scott Glajch | Assignee: | Kelsey Schubert |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
3.4.4 sharded cluster with 18 shards, each consisting of 1 replica, 1 primary, and 1 hidden replica. 3 config servers (CSRS) and 5 mongoS |
||
| Attachments: |
|
| Operating System: | ALL |
| Steps To Reproduce: |
|
| Participants: |
| Description |
|
We have run into a situation where we have the same collection on many shards, but the definition of one of our indexes in this collection varies from shard to shard. In one shard it is sparse and in the other it is not. We are trying to remedy this by just dropping the index entirely and creating the index in the background (as a side note, why on earth do indexes create in the foreground by default). I will try to reproduce this locally without the full production environment in an effort to see if this problem is tied to the indexes not matching between shards. |
| Comments |
| Comment by Ramon Fernandez Marina [ 16/Jul/17 ] | ||||||
|
glajchs, we haven't heard back from you for some time so I'm resolving this ticket. If this is still an issue for you please provide the information requested above by Thomas so we can investigate further. Regards, | ||||||
| Comment by Kelsey Schubert [ 29/Jun/17 ] | ||||||
|
HI glajchs, We still need additional information to diagnose the problem. If this is still an issue for you, would you please upload the diagnostic.data to the portal I've provided? Thank you, | ||||||
| Comment by Kelsey Schubert [ 14/Jun/17 ] | ||||||
|
Hi glajchs, Thank you for taking the time to provide the mongod logs. Please note that diagnostic.data does not contain any customer information,
Since you've expressed concern about privacy, I've created a secure upload portal for you to provide files going forward. Files uploaded to this portal will only be visible to MongoDB employees investigating the issue and are routinely deleted after some time. Would you please upload the diagnostic.data so we can continue to debug this behavior? Thank you, | ||||||
| Comment by Scott Glajch [ 14/Jun/17 ] | ||||||
|
Hi Thomas, thanks for your response. It took me a little while to get the log files. Basically we wanted to obscure customer data by way of database names as well as actual row identifiers. We've had 2 events in production where this has happened. The first event happened when we were on 3.2.13 (so you might want to update the affected versions bug attribute). The second event happened a few hours after we finished our 3.4.4 upgrade. We were hoping that this problem would have been gone with the 3.4.4 upgrade. The logs are of the 1st event. Sadly after anonymizing the logs I realized it was of the 1st event, and I didn't want to do that work all over again. However the format of the logs seems to be essentially the same during both events. Basically we had a list of collections which we knew were in the bad state (some of the shards had this index with the sparse flag and some without). So we were going to go through and dropping the old index, wait 2 seconds, recreate the new index (with background: true), wait 2 seconds, and move onto the next collection. The 2nd collection's drop index is what hung the server. In both cases, only one of our shards went to 100% for the duration of the problematic dropIndex (and only during the problematic dropIndex). The first time (log attached), it was during the 2nd collection's drop index and the problem happened on shard12. The second time, it was during the 7th collection's drop index and the problem happened on shard10. In both cases only the primary CPU spiked. Unfortunately I'm pretty sure we won't be able to attach the diagnostic.data file, as I think it contains customer information. I'll keep researching the format of this file to see if there's a way I can provide it. Also it seems that these diagnotic.data files are rotated and deleted after 1 week, so I only have the files for the 2nd event. I have saved them off in case I can figure out a way to make it safe to share. I'm also hoping to be able to reproduce with test data locally, but I'm not sure I'll be able to get this to happen. | ||||||
| Comment by Kelsey Schubert [ 13/Jun/17 ] | ||||||
|
Hi glajchs, Thank you for the report. Would you please upload the diagnostic.data and mongod logs for an affected node, so we can investigate? I do not expect that this is related to the indexes not matching between shards, and the diagnostic.data and logs will likely allow us to determine the cause of this behavior. Kind regards, |