[SERVER-16348] Assertion failure n >= 0 && n < static_cast<int>(_files.size()) src/mongo/db/storage/extent_manager.cpp 109 Created: 28/Nov/14 Updated: 07/Apr/23 Resolved: 18/May/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Index Maintenance |
| Affects Version/s: | 2.6.9 |
| Fix Version/s: | 2.6.11 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Glen Miner | Assignee: | Eric Milkie |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Minor Change | ||||||||||||
| Steps To Reproduce: | If I had to guess this might be two different clients doing an background ensureIndex on the same collection. We updated to 2.6.6 from 2.4 a few weeks ago – the code in question has been running faithfully for about a year. We tend to do this a lot (create transient collections and delete them weekly with lazy ensureIndex on demand) so I'm terrified about this causing stability problems. |
||||||||||||
| Sprint: | RPL 4 06/05/15 | ||||||||||||
| Participants: | |||||||||||||
| Description |
|
We just had 2/3 of the servers in our cluster crash and and then continue to crash when restarted. The log says:
|
| Comments |
| Comment by Eric Milkie [ 18/May/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
With this fix, renaming a collection with concurrent background indexes is no longer possible (it will return an error). | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 18/May/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'milkie', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}Message: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 02/Mar/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}Message: Revert " This reverts commit 221e9a82b87e4f3297b4b057820c90820bf0d009. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 02/Mar/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {u'username': u'kaloianm', u'name': u'Kaloian Manassiev', u'email': u'kaloian.manassiev@mongodb.com'}Message: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eric Milkie [ 11/Feb/15 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The crash is different on 3.1.0-pre:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Glen Miner [ 02/Dec/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I can confirm we are doing rename, and background indexing, and queries would be hitting the rename target the whole while. It seems like this is likely the same problem. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Glen Miner [ 02/Dec/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The full database is about 13GB – I may have problems with permission, though, since some of the data may be sensitive. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Asya Kamsky [ 01/Dec/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I realized this might be a variant of this bug https://jira.mongodb.org/browse/SERVER-11716
You are using renameCollection - when it has background indexes, it ends up that rename is not atomic the way it would be if the indexes were built in the foreground, so I can see if you had other processes trying to do something else to this collection at the same time, it could cause a bad interaction. I'd like to confirm that this is the case though before marking this as duplicate of that bug (and adding the details over there). | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Asya Kamsky [ 01/Dec/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Definitely we would need the actual malfunctioning DB files - i.e. what's in the dbpath directory. How large are they? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Glen Miner [ 01/Dec/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I've since resurrected the cluster and rewritten the script so that we don't use temp – the indexes are the same, though: (we rename db.temp.* to db.archived.* when finished). The data is extremely tame so I don't see why I can't send it to you – I just mongoexported the two collections involved and it's just over 10MB after gz. I can SFTP or attach here by private comment – whatever is easiest for you. I'm not sure if there's some nuance lost in export so maybe dump or full snapshot is required – either way I can try to help set you up with a repro.
All documents in these collections are uniform and look like this
(it's clan leaderboard data for a game – score, name and rank). | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Asya Kamsky [ 30/Nov/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Would it be possible to provide the stats to db.xxx.getIndexes() from the affected collection? One thing we could do, if you can share the data files from one of the nodes that is crashing/cannot restart (maybe from test cluster?) we can provide you a secure site to scp the data to. It might significantly speed up triaging the problem. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Glen Miner [ 29/Nov/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I copied some of the data involved to a test cluster and crashed it too. I have rewritten things to avoid using a temp collection – it isn't as atomic sadly but it's better than crashing! | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Glen Miner [ 29/Nov/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
New theory: a cron job was smashing the collection in question throughout the long replication; we periodically generated a replacement collection (lotus_stats.temp.InfestedEventScoreT1), indexed it, and then smashed over the live collection
I'm guessing this probably happened at least a dozen times during the replication. I've disabled this cron and am mongodump'ing / remove()ing a bunch of data I can do without to speed up replication so hopefully the next test cycle is only a few hrs. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Glen Miner [ 28/Nov/14 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We just resync'd one of the replicas from scratch for 4 hours and it died again.
|