[SERVER-9904] Secondaries crash when too many indexes Created: 11/Jun/13  Updated: 10/Dec/14  Resolved: 26/Jun/13

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 2.4.4
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: charity majors Assignee: Scott Hernandez (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-9856 No check for building identical backg... Closed
Operating System: ALL
Participants:

 Description   

I upgraded the primary and one secondary on my four node + two arbiter cluster (one node nonvoting) yesterday, from 2.2.3 to 2.4.4

This morning all of my secondaries crashed with this error:

Tue Jun 11 14:26:20 [repl writer worker 15] ERROR: add index fails, too
many indexes for
appdata49.app_b054cda3-7bd2-4f42-8bd9-31ca03987684:DisneyRides key:{
toyStory: 1 }
writer worker caught exception: add index fails, too many indexes for
appdata49.app_b054cda3-7bd2-4f42-8bd9-31ca03987684:DisneyRides key:{
toyStory: 1 } on: { ts: Timestamp 1370960780000|22, h:
2282539247784756123, v: 2, op: "i", ns: "appdata49.system.indexes", o: {
name: "toyStory_1", ns:
"appdata49.app_b054cda3-7bd2-4f42-8bd9-31ca03987684:DisneyRides", key: {
toyStory: 1 }, background: true } }
Tue Jun 11 14:26:20 [repl writer worker 15]   Fatal Assertion 16360
0xb07561 0xacc8b3 0x9abaf6 0xadab5d 0xb4d3d9 0x7fa1bc2cee9a 0x7fa1bb5e1cbd
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xb07561]
 /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0xacc8b3]
 /usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjE
SaIS2_EEPNS0_8SyncTailE+0x156) [0x9abaf6]
 /usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x26d) [0xadab5d]
 /usr/bin/mongod() [0xb4d3d9]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7fa1bc2cee9a]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fa1bb5e1cbd]
Tue Jun 11 14:26:20 [repl writer worker 15]
 
***aborting after fassert() failure
 
 
Tue Jun 11 14:26:20 Got signal: 6 (Aborted).
 
Tue Jun 11 14:26:20 Backtrace:
0xb07561 0x5598c9 0x7fa1bb5244a0 0x7fa1bb524425 0x7fa1bb527b8b 0xacc8ee
0x9abaf6 0xadab5d 0xb4d3d9 0x7fa1bc2cee9a 0x7fa1bb5e1cbd
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xb07561]
 /usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x5598c9]
 /lib/x86_64-linux-gnu/libc.so.6(+0x364a0) [0x7fa1bb5244a0]
 /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7fa1bb524425]
 /lib/x86_64-linux-gnu/libc.so.6(abort+0x17b) [0x7fa1bb527b8b]
 /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xde) [0xacc8ee]
 /usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjE
SaIS2_EEPNS0_8SyncTailE+0x156) [0x9abaf6]
 /usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x26d) [0xadab5d]
 /usr/bin/mongod() [0xb4d3d9]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7fa1bc2cee9a]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fa1bb5e1cbd]

We've seen the "add index fails, too many indexes" error before many times on our 2.2 primaries, and it's never caused a problem. Is this a bug that was introduced in 2.4?

We dynamically generate indexes, so this will make it effectively impossible for us to upgrade to 2.4 until it's fixed.



 Comments   
Comment by Daniel Pasette (Inactive) [ 26/Jun/13 ]

marking as duplicate of SERVER-9856

Comment by charity majors [ 13/Jun/13 ]

Ah!! Fantastic. Thank you so much.

Comment by Scott Hernandez (Inactive) [ 13/Jun/13 ]

I believe the underlying bug is related to creating multiple background indexes at a time, which has been fixed and will be in the 2.4.5 release: SERVER-9856

If you can make sure you don't create the same (background) indexes concurrently then you will be able to avoid this problem in the current 2.4 releases.

Comment by charity majors [ 13/Jun/13 ]

I see what you're saying. Yes, after scanning all my nodes, the secondaries that are reporting "too many indexes" in the logs are all former primaries.

So looking at the collection that generated this error, I see 64 indexes on the primary:

 mongodata3:PRIMARY> db["app_b054cda3-7bd2-4f42-8bd9-31ca03987684:DisneyRides"].getIndexes().length
64

And only 47 indexes on the secondary.

 > db["app_b054cda3-7bd2-4f42-8bd9-31ca03987684:DisneyRides"].getIndexes().length
47

That makes even less sense. Strange.

The index that errored on the secondary:

Tue Jun 11 14:26:22.045 [repl writer worker 13] add index fails, too many indexes for appdata49.app_b054cda3-7bd2-4f42-8bd9-31ca03987684:DisneyRides key:{ toyStory: 1 }
Tue Jun 11 14:26:22.083 [repl writer worker 13] ERROR: writer worker caught exception: add index fails, too many indexes for appdata49.app_b054cda3-7bd2-4f42-8bd9-31ca03987684:DisneyRides key:{ toyStory: 1 } on: { ts: Timestamp 1370960780000|22, h: 2282539247784756123, v: 2, op: "i", ns: "appdata49.system.indexes", o: { name: "toyStory_1", ns: "appdata49.app_b054cda3-7bd2-4f42-8bd9-31ca03987684:DisneyRides", key: { toyStory: 1 }, background: true } }
Tue Jun 11 14:26:22.083 [repl writer worker 13]   Fatal Assertion 16360
0xdd2331 0xd92323 0xc231db 0xd9fe71 0xe1aad9 0x7fe55552ce9a 0x7fe55483fcbd
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdd2331]
 /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0xd92323]
 /usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x13b) [0xc231db]
 /usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x281) [0xd9fe71]
 /usr/bin/mongod() [0xe1aad9]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7fe55552ce9a]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fe55483fcbd]
Tue Jun 11 14:26:22.106 [repl writer worker 13]
 
***aborting after fassert() failure
 
definitely exists on the primary.
 
	{
		"v" : 1,
		"key" : {
			"toyStory" : 1
		},
		"ns" : "appdata49.app_b054cda3-7bd2-4f42-8bd9-31ca03987684:DisneyRides",
		"name" : "toyStory_1",
		"background" : true
	},
 

We don't enforce that only one ensureIndex can be running at the same time. According to the mongo docs for 2.2, only one background index per database can be running at the same time, so we let mongo enforce that. (I think that was relaxed to one index per collection in 2.4?)

Comment by Scott Hernandez (Inactive) [ 12/Jun/13 ]

I created a test with a replica set where I manually created one more index on each replica (not the primary) and then created the 64th index on the primary and in all cases the replica shutdown as you observed because they couldn't create an additional index. I used a 2.2.3 and 2.4.4 primaries with both versions as secondaries and in all configurations it behaved as expected.

Please upload the full logs including the startup stanza of any nodes which have not shutdown after having a replication error creating an index when they were already at the limit.

Also, please note that the logs you provided in your last message are from the primary, not a replica (via replication):

Thu May 23 06:45:25 [conn16856606] add index fails, too many indexes for appdata5.app_774d8859-17e8-4929-9e68-403d64f6d5f4:UserWrapper key:{ email: 1 }

This also agrees with your statement earlier that:

We've seen the "add index fails, too many indexes" error before many times on our 2.2 primaries, and it's never caused a problem

And as such not replicated.

Anything with a [conXXXXXX] after the date in the logs is a user connection and not replication related; those will start like this: [repl writer worker XX]

Comment by Scott Hernandez (Inactive) [ 12/Jun/13 ]

The primary and secondary should have the same indexes so there should not be a reason why the secondary gets an index creation oplog entry which causes it to have too many indexes, since the primary would have got to that state first, and would have failed to create (and replicate) the new index. Now, in the case where a replica gets a replicated command to create an index which it is not able to create (because it will result in too many indexes) then it will shutdown with a stack-trace as you see in the logs. This is not a "crash" but an fassert (fatal assertion). This is an unrecoverable (replication) error – at least without human/manual intervention – but you can remove the extra indexes which caused replication to fail, and the server to shutdown. I'm not suggesting that this is a normal condition or that having replicas shutdown is a good thing, but in the case of an unrecoverable error this is the "correct" thing, and expected.

Now, if you can provide the indexes on the primary and one of these shutdown replicas we can see how what is going on. The real question is how can the primary have different and fewer indexes than your other replicas.

Also, how do you make sure that you don't issue more than one ensureIndex command as the same time for the same index?

Comment by charity majors [ 12/Jun/13 ]

Wait, are you saying this is supposed to crash?

We have seen thousands of these errors, and they have never crashed a secondary before.

Thu May 23 06:45:25 [conn16856606] add index fails, too many indexes for appdata5.app_774d8859-17e8-4929-9e68-403d64f6d5f4:UserWrapper key:

{ email: 1 }

Thu May 23 06:47:45 [conn16856719] add index fails, too many indexes for appdata18.app_f1e23882-768d-4652-84c5-e1fe0c9edb2e:Activity key:

{ _updated_at: 1 }

end connection 10.38.86.217:47168 (5429 connections now open)
Thu May 23 06:47:45 [conn16856720] add index fails, too many indexes for appdata18.app_f1e23882-768d-4652-84c5-e1fe0c9edb2e:Activity key:

{ _updated_at: 1 }

Thu May 23 06:49:20 [conn16856807] add index fails, too many indexes for appdata5.app_774d8859-17e8-4929-9e68-403d64f6d5f4:UserW
rapper key:

{ fbuid: 1 }

end connection 10.169.17.85:44117 (5429 connections now open)
Thu May 23 06:49:20 [conn16856808] add index fails, too many indexes for appdata5.app_774d8859-17e8-4929-9e68-403d64f6d5f4:UserWrapper key:

{ fbuid: 1 }

end connection 10.169.17.85:44118 (5428 connections now open)
Thu May 23 06:49:20 [initandlisten] connection accepted from 10.38.86.217:47185 #16856809 (5429 connections now open)
Thu May 23 06:49:20 [conn16856809] add index fails, too many indexes for appdata5.app_774d8859-17e8-4929-9e68-403d64f6d5f4:UserWrapper key:

{ email: 1 }

It's always been handled correctly when the oplog came from a 2.2 primary. It's only now that we have a 2.4 primary that it is crashing all our secondaries, both 2.2 and 2.4 secondaries.

Yes, we do dynamically create background indexes, but not the same one multiple times.

Comment by Scott Hernandez (Inactive) [ 11/Jun/13 ]

How many indexes do you have and what are they on your primary and secondaries?

Any replication error like this is expected to cause a hard failure since there is no automatic way to recover. Replication just stopping/looping is too easily ignored and causes many other issues.

Do you dynamically create background indexes, possibly the same one multiple times?

Generated at Thu Feb 08 03:21:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.