[SERVER-8012] multiple nodes in replica set crashed at the same time (E11000 duplicate key error index) Created: 21/Dec/12  Updated: 10/Dec/14  Resolved: 19/Aug/14

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 2.2.1
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Roy Smith Assignee: Andre de Frere
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=10.04
DISTRIB_CODENAME=lucid
DISTRIB_DESCRIPTION="Ubuntu 10.04.4 LTS"

$ uname -a
Linux db1a.songza.com 2.6.32-347-ec2 #52-Ubuntu SMP Fri Jul 27 14:38:36 UTC 2012 x86_64 GNU/Linux


Attachments: Text File db1a-mongodb.log     Text File db1c-mongodb.log     Text File db1d-mongodb.log    
Operating System: Linux
Steps To Reproduce:

I certainly hope not

Participants:

 Description   

We're running a 5-node replica set (including 1 hidden member). This morning, 3 nodes (including the master) all crashed at the same time!

The log file for the primary contains:

Fri Dec 21 15:45:45 [repl writer worker 4] ERROR: writer worker caught exception: E11000 duplicate key error index: songza.song_bans.$station_id_1_song_id_1 dup key: { : 1382507, : 10983572 } on: { ts: Timestamp 1356103732000|9, h: -4292312148168796626, v: 2, op: "i", ns: "songza.song_bans", o:

{ _id: ObjectId('50d4803398896c3d577f54c4'), ban_time: new Date(1356103731894), song_id: 10983572, user_id: 1079677, station_id: 1382507 }

}
Fri Dec 21 15:45:45 [repl writer worker 4] Fatal Assertion 16360
0xaf8c41 0xabe223 0x99b83e 0xacc4cd 0xb3ec79 0x7f671505e9ca 0x7f6714405cdd
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xaf8c41]
/usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0xabe223]
/usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x13e) [0x99b83e]
/usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x26d) [0xacc4cd]
/usr/bin/mongod() [0xb3ec79]
/lib/libpthread.so.0(+0x69ca) [0x7f671505e9ca]
/lib/libc.so.6(clone+0x6d) [0x7f6714405cdd]
Fri Dec 21 15:45:45 [repl writer worker 4]

***aborting after fassert() failure

Fri Dec 21 15:45:45 Got signal: 6 (Aborted).

Fri Dec 21 15:45:45 Backtrace:
0xaf8c41 0x5586c9 0x7f6714352af0 0x7f6714352a75 0x7f67143565c0 0xabe25e 0x99b83e 0xacc4cd 0xb3ec79 0x7f671505e9ca 0x7f6714405cdd
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xaf8c41]
/usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x5586c9]
/lib/libc.so.6(+0x33af0) [0x7f6714352af0]
/lib/libc.so.6(gsignal+0x35) [0x7f6714352a75]
/lib/libc.so.6(abort+0x180) [0x7f67143565c0]
/usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xde) [0xabe25e]
/usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x13e) [0x99b83e]
/usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x26d) [0xacc4cd]
/usr/bin/mongod() [0xb3ec79]
/lib/libpthread.so.0(+0x69ca) [0x7f671505e9ca]
/lib/libc.so.6(clone+0x6d) [0x7f6714405cdd]

Attempts to restart or repair this node have not been successful. We were able to restart the two crashed secondaries, and one of them eventually stepped up to become the primary.

I'm attaching log files from the three nodes in question. db1a is the hidden member. db1c was the primary (and is the one which will not come back up). db1d was (and continues to be) one of the secondaries.



 Comments   
Comment by Andre de Frere [ 17/Mar/13 ]

Hi Roy,

Given that, I'll set this ticket to closed and we can continue working in SERVER-9004.

Regards,
André

Comment by Roy Smith [ 17/Mar/13 ]

Hmmm, just noticed this is waiting for input from me. We were able to track down and delete the duplicate data, and upgraded to 2.2.2 on all our servers. Everything we running fine for a couple of months, but we just had another crash this evening which looks like a similar problem. See SERVER-9004.

Comment by Andre de Frere [ 13/Mar/13 ]

Hi Roy,

Checking in with you on this subject. Do you have any responses to the previously asked questions?

Comment by Andre de Frere [ 06/Feb/13 ]

Hi Roy,

There could be a few things happening here. First, the backup you restored could have had the corruption already in the oplog. While the 2.2.2 release should prevent further instances of the corruption from appearing, it will not be able to repair corruption currently existing in the Oplog. Secondly., it could be possible that there is actually duplicate keys in the index that you are trying to insert in to. Have you checked the data for duplicate key values?

Regards,
André

Comment by Roy Smith [ 08/Jan/13 ]

I don't think this problem has been solved by 2.2.2.

We've installed 2.2.2 everywhere. Have not had any further crashes in production, but I just restored a dump from production into a development instance and got a similar crash. The command I ran was:

$ mongorestore mongo-db1a.songza.com-2013-01-06 --oplogReplay

the last few lines of output are:

Tue Jan 8 11:55:53 Creating index: { key:

{ _id: 1 }

, ns: "songza.situation_user_data", name: "id" }
Tue Jan 8 11:55:54 Creating index: { key:

{ user_id: 1 }

, ns: "songza.situation_user_data", name: "user_id_1" }
Tue Jan 8 11:58:28 mongo-db1a.songza.com-2013-01-06/songza/song_bans.bson
Tue Jan 8 11:58:28 going into namespace [songza.song_bans]
Tue Jan 8 11:58:31 10635400/37529218 28% (bytes)
Tue Jan 8 11:58:34 17416800/37529218 46% (bytes)
Tue Jan 8 11:58:37 24190360/37529218 64% (bytes)
Tue Jan 8 11:58:40 30520760/37529218 81% (bytes)
Tue Jan 8 11:58:43 37228360/37529218 99% (bytes)
457669 objects found
Tue Jan 8 11:58:43 Creating index: { key:

{ _id: 1 }

, ns: "songza.song_bans", name: "id" }
Tue Jan 8 11:58:48 Creating index: { key:

{ station_id: 1, song_id: 1 }

, unique: true, ns: "songza.song_bans", name: "station_id_1_song_id_1", background: true }
Tue Jan 8 11:58:53 ERROR: Error creating index songza.song_bans: 11000 err: "E11000 duplicate key error index: songza.song_bans.$station_id_1_song_id_1 dup key: { : 1382507, : 10983572 }"
Aborted (core dumped)
roy@dev3:/mnt/songza$

This is the same collection (song_bans) and the same error

dup key: { : 1382507, : 10983572 }

on which mongo crashed earlier. There is apparently some piece of bad data in this collection which is not being handled properly, and 2.2.2 didn't fix that.

Comment by Roy Smith [ 27/Dec/12 ]

OK, we'll try upgrading to 2.2.2.

We haven not made any attempts to replicate the issue.

We have already recovered the secondary by resyncing.

Comment by Andre de Frere [ 27/Dec/12 ]

Hi Roy,

It appears you have an even number of voting nodes in your replica set. This is not a recommended strategy, as discussed in the documentation.

The stack trace that appears in the logs, and in the description of this ticket could be caused by the idempotency issues that were resolved in 2.2.2. You can see more discussion on this subject at SERVER-7551, SERVER-7186 and SERVER-6671. Version 2.2.2 should prevent any new issues for idempotency in the OpLog, but you will still need to manually recover the failed secondaries (by fully resyncing, or by doing a file system copy of the db path). Do you have a test environment where you can try 2.2.2? Can you cause this problem to happen again at will?

Comment by Roy Smith [ 21/Dec/12 ]

I turns out I described the situation incorrectly. I originally thought db1c was the primary. That's not correct. It was db1e. Apparently, when multiple secondaries crashed, db1e demoted itself to a secondary:

roy@dev1:~$ mongo db1e.songza.com
MongoDB shell version: 2.2.1
connecting to: db1e.songza.com/test
songza-new:SECONDARY>

and then came back to being the primary once the other two secondaries came back up. We also have an arbiter in the mix:

songza-new:PRIMARY> rs.conf()
{
"_id" : "songza-new",
"version" : 70,
"members" : [
{
"_id" : 3,
"host" : "db1a.songza.com:27017",
"priority" : 0,
"hidden" : true,
"tags" :

{ "role" : "reporting" }

},
{
"_id" : 4,
"host" : "db1b.songza.com:27017",
"tags" :

{ "feed_test" : "A", "role" : "production" }

},
{
"_id" : 6,
"host" : "db1e.songza.com:27017",
"priority" : 5,
"tags" :

{ "role" : "production" }

},
{
"_id" : 7,
"host" : "db1c.songza.com:27017",
"tags" :

{ "feed_test" : "B", "role" : "production" }

},
{
"_id" : 8,
"host" : "db1d.songza.com:27017",
"priority" : 0,
"hidden" : true,
"tags" :

{ "role" : "production" }

},

{ "_id" : 9, "host" : "db-arbiter.songza.com:27017", "arbiterOnly" : true }

]
}

Generated at Thu Feb 08 03:16:18 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.