[SERVER-8012] multiple nodes in replica set crashed at the same time (E11000 duplicate key error index) Created: 21/Dec/12 Updated: 10/Dec/14 Resolved: 19/Aug/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 2.2.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Roy Smith | Assignee: | Andre de Frere |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
$ cat /etc/lsb-release $ uname -a |
||
| Attachments: |
|
| Operating System: | Linux |
| Steps To Reproduce: | I certainly hope not |
| Participants: |
| Description |
|
We're running a 5-node replica set (including 1 hidden member). This morning, 3 nodes (including the master) all crashed at the same time! The log file for the primary contains: Fri Dec 21 15:45:45 [repl writer worker 4] ERROR: writer worker caught exception: E11000 duplicate key error index: songza.song_bans.$station_id_1_song_id_1 dup key: { : 1382507, : 10983572 } on: { ts: Timestamp 1356103732000|9, h: -4292312148168796626, v: 2, op: "i", ns: "songza.song_bans", o: { _id: ObjectId('50d4803398896c3d577f54c4'), ban_time: new Date(1356103731894), song_id: 10983572, user_id: 1079677, station_id: 1382507 } } ***aborting after fassert() failure Fri Dec 21 15:45:45 Got signal: 6 (Aborted). Fri Dec 21 15:45:45 Backtrace: Attempts to restart or repair this node have not been successful. We were able to restart the two crashed secondaries, and one of them eventually stepped up to become the primary. I'm attaching log files from the three nodes in question. db1a is the hidden member. db1c was the primary (and is the one which will not come back up). db1d was (and continues to be) one of the secondaries. |
| Comments |
| Comment by Andre de Frere [ 17/Mar/13 ] |
|
Hi Roy, Given that, I'll set this ticket to closed and we can continue working in Regards, |
| Comment by Roy Smith [ 17/Mar/13 ] |
|
Hmmm, just noticed this is waiting for input from me. We were able to track down and delete the duplicate data, and upgraded to 2.2.2 on all our servers. Everything we running fine for a couple of months, but we just had another crash this evening which looks like a similar problem. See |
| Comment by Andre de Frere [ 13/Mar/13 ] |
|
Hi Roy, Checking in with you on this subject. Do you have any responses to the previously asked questions? |
| Comment by Andre de Frere [ 06/Feb/13 ] |
|
Hi Roy, There could be a few things happening here. First, the backup you restored could have had the corruption already in the oplog. While the 2.2.2 release should prevent further instances of the corruption from appearing, it will not be able to repair corruption currently existing in the Oplog. Secondly., it could be possible that there is actually duplicate keys in the index that you are trying to insert in to. Have you checked the data for duplicate key values? Regards, |
| Comment by Roy Smith [ 08/Jan/13 ] |
|
I don't think this problem has been solved by 2.2.2. We've installed 2.2.2 everywhere. Have not had any further crashes in production, but I just restored a dump from production into a development instance and got a similar crash. The command I ran was: $ mongorestore mongo-db1a.songza.com-2013-01-06 --oplogReplay the last few lines of output are: Tue Jan 8 11:55:53 Creating index: { key: { _id: 1 }, ns: "songza.situation_user_data", name: "id" } , ns: "songza.situation_user_data", name: "user_id_1" } , ns: "songza.song_bans", name: "id" } , unique: true, ns: "songza.song_bans", name: "station_id_1_song_id_1", background: true } This is the same collection (song_bans) and the same error dup key: { : 1382507, : 10983572 } on which mongo crashed earlier. There is apparently some piece of bad data in this collection which is not being handled properly, and 2.2.2 didn't fix that. |
| Comment by Roy Smith [ 27/Dec/12 ] |
|
OK, we'll try upgrading to 2.2.2. We haven not made any attempts to replicate the issue. We have already recovered the secondary by resyncing. |
| Comment by Andre de Frere [ 27/Dec/12 ] |
|
Hi Roy, It appears you have an even number of voting nodes in your replica set. This is not a recommended strategy, as discussed in the documentation. The stack trace that appears in the logs, and in the description of this ticket could be caused by the idempotency issues that were resolved in 2.2.2. You can see more discussion on this subject at |
| Comment by Roy Smith [ 21/Dec/12 ] |
|
I turns out I described the situation incorrectly. I originally thought db1c was the primary. That's not correct. It was db1e. Apparently, when multiple secondaries crashed, db1e demoted itself to a secondary: roy@dev1:~$ mongo db1e.songza.com and then came back to being the primary once the other two secondaries came back up. We also have an arbiter in the mix: songza-new:PRIMARY> rs.conf() }, }, }, }, }, { "_id" : 9, "host" : "db-arbiter.songza.com:27017", "arbiterOnly" : true } ] |