[SERVER-29603] Dropping indexes sometimes locks our entire cluster Created: 13/Jun/17  Updated: 24/Aug/17  Resolved: 16/Jul/17

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.2.13, 3.4.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Scott Glajch Assignee: Kelsey Schubert
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

3.4.4 sharded cluster with 18 shards, each consisting of 1 replica, 1 primary, and 1 hidden replica. 3 config servers (CSRS) and 5 mongoS


Attachments: File shard.log_trimmed    
Operating System: ALL
Steps To Reproduce:
  1. Drop indexes
  2. See if reads/writes are blocked
Participants:

 Description   

We have run into a situation where we have the same collection on many shards, but the definition of one of our indexes in this collection varies from shard to shard. In one shard it is sparse and in the other it is not.
While this was our fault, this state blocks these collections from being balanced, as the balancer throws this error:
"failed to create index before migrating data. error: IndexOptionsConflict: Index with name: eIds.tId_1_eIds.v_-1 already exists with different options"

We are trying to remedy this by just dropping the index entirely and creating the index in the background (as a side note, why on earth do indexes create in the foreground by default).
This does work, however we have to repair something like 100 collections, and twice now within the first 5 drops, one of these drop indexes will completely lock up our system.

I will try to reproduce this locally without the full production environment in an effort to see if this problem is tied to the indexes not matching between shards.



 Comments   
Comment by Ramon Fernandez Marina [ 16/Jul/17 ]

glajchs, we haven't heard back from you for some time so I'm resolving this ticket. If this is still an issue for you please provide the information requested above by Thomas so we can investigate further.

Regards,
Ramón.

Comment by Kelsey Schubert [ 29/Jun/17 ]

HI glajchs,

We still need additional information to diagnose the problem. If this is still an issue for you, would you please upload the diagnostic.data to the portal I've provided?

Thank you,
Thomas

Comment by Kelsey Schubert [ 14/Jun/17 ]

Hi glajchs,

Thank you for taking the time to provide the mongod logs. Please note that diagnostic.data does not contain any customer information,
and you're welcome to review the source code to see what is collected. Essentially, it is the output of the following commands, plus some system metrics:

serverStatus: db.serverStatus({tcmalloc: true})
replSetGetStatus: rs.status()
collStats for local.oplog.rs: db.getSiblingDB('local').oplog.rs.stats()
getCmdLineOpts: db.adminCommand({getCmdLineOpts: true})
buildInfo: db.adminCommand({buildInfo: true})
hostInfo: db.adminCommand({hostInfo: true})

Since you've expressed concern about privacy, I've created a secure upload portal for you to provide files going forward. Files uploaded to this portal will only be visible to MongoDB employees investigating the issue and are routinely deleted after some time.

Would you please upload the diagnostic.data so we can continue to debug this behavior?

Thank you,
Thomas

Comment by Scott Glajch [ 14/Jun/17 ]

Hi Thomas, thanks for your response. It took me a little while to get the log files. Basically we wanted to obscure customer data by way of database names as well as actual row identifiers.

We've had 2 events in production where this has happened. The first event happened when we were on 3.2.13 (so you might want to update the affected versions bug attribute). The second event happened a few hours after we finished our 3.4.4 upgrade. We were hoping that this problem would have been gone with the 3.4.4 upgrade. The logs are of the 1st event. Sadly after anonymizing the logs I realized it was of the 1st event, and I didn't want to do that work all over again. However the format of the logs seems to be essentially the same during both events. Basically we had a list of collections which we knew were in the bad state (some of the shards had this index with the sparse flag and some without). So we were going to go through and dropping the old index, wait 2 seconds, recreate the new index (with background: true), wait 2 seconds, and move onto the next collection. The 2nd collection's drop index is what hung the server.

In both cases, only one of our shards went to 100% for the duration of the problematic dropIndex (and only during the problematic dropIndex). The first time (log attached), it was during the 2nd collection's drop index and the problem happened on shard12. The second time, it was during the 7th collection's drop index and the problem happened on shard10. In both cases only the primary CPU spiked.

Unfortunately I'm pretty sure we won't be able to attach the diagnostic.data file, as I think it contains customer information. I'll keep researching the format of this file to see if there's a way I can provide it. Also it seems that these diagnotic.data files are rotated and deleted after 1 week, so I only have the files for the 2nd event. I have saved them off in case I can figure out a way to make it safe to share.

I'm also hoping to be able to reproduce with test data locally, but I'm not sure I'll be able to get this to happen.

Comment by Kelsey Schubert [ 13/Jun/17 ]

Hi glajchs,

Thank you for the report. Would you please upload the diagnostic.data and mongod logs for an affected node, so we can investigate?

I do not expect that this is related to the indexes not matching between shards, and the diagnostic.data and logs will likely allow us to determine the cause of this behavior.

Kind regards,
Thomas

Generated at Thu Feb 08 04:21:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.