[SERVER-9904] Secondaries crash when too many indexes Created: 11/Jun/13 Updated: 10/Dec/14 Resolved: 26/Jun/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 2.4.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | charity majors | Assignee: | Scott Hernandez (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
I upgraded the primary and one secondary on my four node + two arbiter cluster (one node nonvoting) yesterday, from 2.2.3 to 2.4.4 This morning all of my secondaries crashed with this error:
We've seen the "add index fails, too many indexes" error before many times on our 2.2 primaries, and it's never caused a problem. Is this a bug that was introduced in 2.4? We dynamically generate indexes, so this will make it effectively impossible for us to upgrade to 2.4 until it's fixed. |
| Comments |
| Comment by Daniel Pasette (Inactive) [ 26/Jun/13 ] | |||||||||||||||||||||||||||||||
|
marking as duplicate of | |||||||||||||||||||||||||||||||
| Comment by charity majors [ 13/Jun/13 ] | |||||||||||||||||||||||||||||||
|
Ah!! Fantastic. Thank you so much. | |||||||||||||||||||||||||||||||
| Comment by Scott Hernandez (Inactive) [ 13/Jun/13 ] | |||||||||||||||||||||||||||||||
|
I believe the underlying bug is related to creating multiple background indexes at a time, which has been fixed and will be in the 2.4.5 release: If you can make sure you don't create the same (background) indexes concurrently then you will be able to avoid this problem in the current 2.4 releases. | |||||||||||||||||||||||||||||||
| Comment by charity majors [ 13/Jun/13 ] | |||||||||||||||||||||||||||||||
|
I see what you're saying. Yes, after scanning all my nodes, the secondaries that are reporting "too many indexes" in the logs are all former primaries. So looking at the collection that generated this error, I see 64 indexes on the primary:
And only 47 indexes on the secondary.
That makes even less sense. Strange. The index that errored on the secondary:
We don't enforce that only one ensureIndex can be running at the same time. According to the mongo docs for 2.2, only one background index per database can be running at the same time, so we let mongo enforce that. (I think that was relaxed to one index per collection in 2.4?) | |||||||||||||||||||||||||||||||
| Comment by Scott Hernandez (Inactive) [ 12/Jun/13 ] | |||||||||||||||||||||||||||||||
|
I created a test with a replica set where I manually created one more index on each replica (not the primary) and then created the 64th index on the primary and in all cases the replica shutdown as you observed because they couldn't create an additional index. I used a 2.2.3 and 2.4.4 primaries with both versions as secondaries and in all configurations it behaved as expected. Please upload the full logs including the startup stanza of any nodes which have not shutdown after having a replication error creating an index when they were already at the limit. Also, please note that the logs you provided in your last message are from the primary, not a replica (via replication):
This also agrees with your statement earlier that:
And as such not replicated. Anything with a [conXXXXXX] after the date in the logs is a user connection and not replication related; those will start like this: [repl writer worker XX] | |||||||||||||||||||||||||||||||
| Comment by Scott Hernandez (Inactive) [ 12/Jun/13 ] | |||||||||||||||||||||||||||||||
|
The primary and secondary should have the same indexes so there should not be a reason why the secondary gets an index creation oplog entry which causes it to have too many indexes, since the primary would have got to that state first, and would have failed to create (and replicate) the new index. Now, in the case where a replica gets a replicated command to create an index which it is not able to create (because it will result in too many indexes) then it will shutdown with a stack-trace as you see in the logs. This is not a "crash" but an fassert (fatal assertion). This is an unrecoverable (replication) error – at least without human/manual intervention – but you can remove the extra indexes which caused replication to fail, and the server to shutdown. I'm not suggesting that this is a normal condition or that having replicas shutdown is a good thing, but in the case of an unrecoverable error this is the "correct" thing, and expected. Now, if you can provide the indexes on the primary and one of these shutdown replicas we can see how what is going on. The real question is how can the primary have different and fewer indexes than your other replicas. Also, how do you make sure that you don't issue more than one ensureIndex command as the same time for the same index? | |||||||||||||||||||||||||||||||
| Comment by charity majors [ 12/Jun/13 ] | |||||||||||||||||||||||||||||||
|
Wait, are you saying this is supposed to crash? We have seen thousands of these errors, and they have never crashed a secondary before. Thu May 23 06:45:25 [conn16856606] add index fails, too many indexes for appdata5.app_774d8859-17e8-4929-9e68-403d64f6d5f4:UserWrapper key: { email: 1 }Thu May 23 06:47:45 [conn16856719] add index fails, too many indexes for appdata18.app_f1e23882-768d-4652-84c5-e1fe0c9edb2e:Activity key: { _updated_at: 1 }end connection 10.38.86.217:47168 (5429 connections now open) Thu May 23 06:49:20 [conn16856807] add index fails, too many indexes for appdata5.app_774d8859-17e8-4929-9e68-403d64f6d5f4:UserW end connection 10.169.17.85:44117 (5429 connections now open) end connection 10.169.17.85:44118 (5428 connections now open) It's always been handled correctly when the oplog came from a 2.2 primary. It's only now that we have a 2.4 primary that it is crashing all our secondaries, both 2.2 and 2.4 secondaries. Yes, we do dynamically create background indexes, but not the same one multiple times. | |||||||||||||||||||||||||||||||
| Comment by Scott Hernandez (Inactive) [ 11/Jun/13 ] | |||||||||||||||||||||||||||||||
|
How many indexes do you have and what are they on your primary and secondaries? Any replication error like this is expected to cause a hard failure since there is no automatic way to recover. Replication just stopping/looping is too easily ignored and causes many other issues. Do you dynamically create background indexes, possibly the same one multiple times? |