[SERVER-13514] ERROR: writer worker caught exception: E11000 duplicate key error index: Created: 08/Apr/14  Updated: 10/Dec/14  Resolved: 12/Apr/14

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.4.9, 2.4.10
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Dev Ops Assignee: Thomas Rueckstiess
Resolution: Done Votes: 0
Labels: replicaset
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

EC2
Linux mongo-005.foo.com 3.2.0-60-virtual #91-Ubuntu SMP Wed Feb 19 04:13:28 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux


Operating System: Linux
Participants:

 Description   

Our replication slaves are crashing and we need monit in order to start mongo and keep our servers running.

we have 3 nodes and one arbiter in our replication set.

blah:SECONDARY> rs.status()
{
        "set" : "blah",
        "date" : ISODate("2014-04-08T17:44:54Z"),
        "myState" : 2,
        "syncingTo" : "ec2-54-217-204-132.eu-west-1.compute.amazonaws.com:27017",
        "members" : [
                {
                        "_id" : 9,
                        "name" : "ec2.compute.amazonaws.com:27017",
                        "health" : 1,
                        "state" : 7,
                        "stateStr" : "ARBITER",
                        "uptime" : 4418,
                        "lastHeartbeat" : ISODate("2014-04-08T17:44:52Z"),
                        "lastHeartbeatRecv" : ISODate("2014-04-08T17:44:53Z"),
                        "pingMs" : 0
                },
                {
                        "_id" : 10,
                        "name" : "ec2.compute.amazonaws.com:27017",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "uptime" : 4416,
                        "optime" : Timestamp(1396979093, 117),
                        "optimeDate" : ISODate("2014-04-08T17:44:53Z"),
                        "lastHeartbeat" : ISODate("2014-04-08T17:44:53Z"),
                        "lastHeartbeatRecv" : ISODate("2014-04-08T17:44:53Z"),
                        "pingMs" : 1
                },
                {
                        "_id" : 11,
                        "name" : "ec2.compute.amazonaws.com:27017",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 4427,
                        "optime" : Timestamp(1396979094, 42),
                        "optimeDate" : ISODate("2014-04-08T17:44:54Z"),
                        "self" : true
                },
                {
                        "_id" : 12,
                        "name" : "ec2.compute.amazonaws.com:27017",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 4354,
                        "optime" : Timestamp(1396979094, 30),
                        "optimeDate" : ISODate("2014-04-08T17:44:54Z"),
                        "lastHeartbeat" : ISODate("2014-04-08T17:44:54Z"),
                        "lastHeartbeatRecv" : ISODate("2014-04-08T17:44:53Z"),
                        "pingMs" : 0,
                        "syncingTo" : "ec2.eu-west-1.compute.amazonaws.com:27017"
                }
        ],
        "ok" : 1
}
 

Tue Apr  8 15:41:02.882 [repl writer worker 2] ERROR: writer worker caught exception: E11000 duplicate key error index: reporting_events-production.tracking_game_campaigns.$token_1_game_id_1  dup key: { : null, : null } on: { ts: Timestamp 1396971662000|179, h: -5379447976443473461, v: 2, op: "u", ns: "reporting_events-production.tracking_game_campaigns", o2: { _id: ObjectId('531e6a0c365ad5221f00057a') }, o: { $set: { campaign_id: 1214255, expires_at: new Date(1397058062870), updated_at: new Date(1396971662875) } } }
 
Tue Apr  8 15:41:37.327 [repl writer worker 3] ERROR: writer worker caught exception: E11000 duplicate key error index: reporting_events-production.tracking_game_campaigns.$token_1_game_id_1  dup key: { : null, : null } on: { ts: Timestamp 1396971662000|179, h: -5379447976443473461, v: 2, op: "u", ns: "reporting_events-production.tracking_game_campaigns", o2: { _id: ObjectId('531e6a0c365ad5221f00057a') }, o: { $set: { campaign_id: 1214255, expires_at: new Date(1397058062870), updated_at: new Date(1396971662875) } } }
 
Tue Apr  8 15:42:13.895 [repl writer worker 4] ERROR: writer worker caught exception: E11000 duplicate key error index: reporting_events-production.tracking_game_campaigns.$token_1_game_id_1  dup key: { : null, : null } on: { ts: Timestamp 1396971662000|179, h: -5379447976443473461, v: 2, op: "u", ns: "reporting_events-production.tracking_game_campaigns", o2: { _id: ObjectId('531e6a0c365ad5221f00057a') }, o: { $set: { campaign_id: 1214255, expires_at: new Date(1397058062870), updated_at: new Date(1396971662875) } } }
 
Tue Apr  8 16:30:54.943 [repl writer worker 5] ERROR: writer worker caught exception: E11000 duplicate key error index: reporting_events-production.tracking_game_campaigns.$token_1_game_id_1  dup key: { : null, : null } on: { ts: Timestamp 1396974654000|125, h: -6524155580285384310, v: 2, op: "u", ns: "reporting_events-production.tracking_game_campaigns", o2: { _id: ObjectId('531e6b7dc1be4444e2001b58') }, o: { $set: { campaign_id: 1165644, updated_at: new Date(1396974654930) } } }



 Comments   
Comment by Thomas Rueckstiess [ 12/Apr/14 ]

You're welcome. I'll close this ticket now as "works as designed" as I didn't see any evidence of a bug in MongoDB.

Regards,
Thomas

Comment by Dev Ops [ 11/Apr/14 ]

we are now without mongo crashes for 1d and 8 hours. we cleared the problematic entries and everything is fine now.
thank you for the help

Comment by Thomas Rueckstiess [ 10/Apr/14 ]

From your outputs above it looks like the primary has the correct data (consistent with the unique index), and the secondaries hold documents that shouldn't exist or are missing some fields.

I'd advise to be careful and take backups of your nodes before resyncing, as well as to audit your data on the primary to find out if it contains a complete and consistent dataset.

Without any further information my best guess at this stage is that this was somehow caused during the data migration you mentioned. We'd be interested to find out how your secondaries ended up with documents violating the uniqueness constraint, but we'd need the log files as a first step to continue the investigation.

Thanks,
Thomas

Comment by Dev Ops [ 10/Apr/14 ]

Thank you! We will think about a complete resync of our secondaries.
Can you explain why the Primary isn't the last instance when it comes diverged data?

Comment by Thomas Rueckstiess [ 10/Apr/14 ]

Hi,

As you can see, the data on the primary and secondaries have diverged. The primary does not contain any data where token and game_id are both null, but the secondaries do. To restore data consistency you will have to resync your secondary nodes completely from scratch. The process is explained on our page Resync a Member of a Replica Set.

This could have been caused during the data migration, but I don't have enough information to be certain. The log files might provide some insight, if you want I can have a look at them. You can upload them securely to our servers without making them public with these steps:

Please compress your files and upload them to our site. You can do this using the following command:

scp -P 722 -r <filename> SERVER-13514@www.mongodb.com:

Substitute <filename> with the name of the file(s) you want to upload. Note the colon ':' at the end of the command. When you are prompted for a password, please just press ENTER.

Regards,
Thomas

Comment by Dev Ops [ 10/Apr/14 ]

Hi Thomas!

Here are the requested command outputs:

blah:PRIMARY> use reporting_events-production
switched to db reporting_events-production
blah:PRIMARY> db.tracking_game_campaigns.find({_id: ObjectId('531e6a0c365ad5221f00057a')})
blah:PRIMARY> db.tracking_game_campaigns.find({token: null, game_id: null})
blah:PRIMARY>

blah:SECONDARY> use reporting_events-production
switched to db reporting_events-production
blah:SECONDARY> db.tracking_game_campaigns.find({_id: ObjectId('531e6a0c365ad5221f00057a')})
blah:SECONDARY> db.tracking_game_campaigns.find({token: null, game_id: null})
{ "_id" : ObjectId("531e6b19c1be4444e2001a63"), "campaign_id" : 1201975, "updated_at" : ISODate("2014-03-24T13:23:16.838Z") }
{ "_id" : ObjectId("531e695ec1be44eb2800228c"), "campaign_id" : 1201975, "updated_at" : ISODate("2014-03-24T17:57:51.259Z") }
{ "_id" : ObjectId("531e693786cd346cae00063d"), "campaign_id" : 1201975, "updated_at" : ISODate("2014-03-24T20:35:46.964Z") }
{ "_id" : ObjectId("531e68bfa643660d000003dc"), "campaign_id" : 1202551, "updated_at" : ISODate("2014-03-24T20:56:23.847Z") }
{ "_id" : ObjectId("531e694c4c4fcd42910006f2"), "campaign_id" : 1201975, "updated_at" : ISODate("2014-03-21T20:20:50.198Z") }
{ "_id" : ObjectId("531e69db9dd5740350000b1b"), "campaign_id" : 1202142, "updated_at" : ISODate("2014-03-21T20:21:44.982Z") }
{ "_id" : ObjectId("531e689586cd34d8a4000830"), "campaign_id" : 1173177, "updated_at" : ISODate("2014-04-01T19:28:31.653Z") }
{ "_id" : ObjectId("531e696a365ad55bd400090d"), "campaign_id" : 1201980, "updated_at" : ISODate("2014-03-21T22:05:05.381Z") }
{ "_id" : ObjectId("531e69e7365ad5221f000501"), "campaign_id" : 1201975, "updated_at" : ISODate("2014-03-21T22:06:23.633Z") }
{ "_id" : ObjectId("531e68bec1be4444e2001980"), "campaign_id" : 1201956, "updated_at" : ISODate("2014-03-27T13:07:55.755Z") }
{ "_id" : ObjectId("531e6b0da64366e9b2000234"), "campaign_id" : 1201956, "updated_at" : ISODate("2014-03-27T13:09:18.735Z") }
{ "_id" : ObjectId("531e6aca2fc5cbd1320005a5"), "campaign_id" : 1202005, "updated_at" : ISODate("2014-04-08T01:24:12.704Z") }
{ "_id" : ObjectId("531e694a365ad55bd40008cf"), "campaign_id" : 1202713, "updated_at" : ISODate("2014-04-08T13:52:46.822Z") }
{ "_id" : ObjectId("531e68cbc1be4444e20019c7"), "campaign_id" : 1202551, "updated_at" : ISODate("2014-03-24T02:28:56.514Z") }
{ "_id" : ObjectId("531e695aa6436643d8000342"), "campaign_id" : 1202142, "updated_at" : ISODate("2014-03-24T02:49:27.830Z") }
{ "_id" : ObjectId("531e69134c4fcd5bc7000192"), "campaign_id" : 1201956, "updated_at" : ISODate("2014-03-22T13:46:27.310Z") }
{ "_id" : ObjectId("531e6a0686cd346cae000835"), "campaign_id" : 1201976, "updated_at" : ISODate("2014-03-22T13:47:07.229Z") }
{ "_id" : ObjectId("531e69bb971d028be30004a3"), "campaign_id" : 1201956, "updated_at" : ISODate("2014-03-25T13:51:15.836Z") }
{ "_id" : ObjectId("531e69db2fc5cb4714001a52"), "campaign_id" : 1202003, "updated_at" : ISODate("2014-03-25T13:55:15.475Z") }
{ "_id" : ObjectId("531e6a4386cd3428490009ff"), "campaign_id" : 1202551, "updated_at" : ISODate("2014-03-22T00:03:35.261Z") }
Type "it" for more
blah:SECONDARY>

blah:SECONDARY> use reporting_events-production
switched to db reporting_events-production
blah:SECONDARY> db.tracking_game_campaigns.find({_id: ObjectId('531e6a0c365ad5221f00057a')})
blah:SECONDARY> db.tracking_game_campaigns.find({token: null, game_id: null})
{ "_id" : ObjectId("531e6b19c1be4444e2001a63"), "campaign_id" : 1201975, "updated_at" : ISODate("2014-03-24T13:23:16.838Z") }
{ "_id" : ObjectId("531e695ec1be44eb2800228c"), "campaign_id" : 1201975, "updated_at" : ISODate("2014-03-24T17:57:51.259Z") }
{ "_id" : ObjectId("531e693786cd346cae00063d"), "campaign_id" : 1201975, "updated_at" : ISODate("2014-03-24T20:35:46.964Z") }
{ "_id" : ObjectId("531e68bfa643660d000003dc"), "campaign_id" : 1202551, "updated_at" : ISODate("2014-03-24T20:56:23.847Z") }
{ "_id" : ObjectId("531e694c4c4fcd42910006f2"), "campaign_id" : 1201975, "updated_at" : ISODate("2014-03-21T20:20:50.198Z") }
{ "_id" : ObjectId("531e69db9dd5740350000b1b"), "campaign_id" : 1202142, "updated_at" : ISODate("2014-03-21T20:21:44.982Z") }
{ "_id" : ObjectId("531e689586cd34d8a4000830"), "campaign_id" : 1173177, "updated_at" : ISODate("2014-04-01T19:28:31.653Z") }
{ "_id" : ObjectId("531e696a365ad55bd400090d"), "campaign_id" : 1201980, "updated_at" : ISODate("2014-03-21T22:05:05.381Z") }
{ "_id" : ObjectId("531e69e7365ad5221f000501"), "campaign_id" : 1201975, "updated_at" : ISODate("2014-03-21T22:06:23.633Z") }
{ "_id" : ObjectId("531e68bec1be4444e2001980"), "campaign_id" : 1201956, "updated_at" : ISODate("2014-03-27T13:07:55.755Z") }
{ "_id" : ObjectId("531e6b0da64366e9b2000234"), "campaign_id" : 1201956, "updated_at" : ISODate("2014-03-27T13:09:18.735Z") }
{ "_id" : ObjectId("531e6aca2fc5cbd1320005a5"), "campaign_id" : 1202005, "updated_at" : ISODate("2014-04-08T01:24:12.704Z") }
{ "_id" : ObjectId("531e694a365ad55bd40008cf"), "campaign_id" : 1202713, "updated_at" : ISODate("2014-04-08T13:52:46.822Z") }
{ "_id" : ObjectId("531e68cbc1be4444e20019c7"), "campaign_id" : 1202551, "updated_at" : ISODate("2014-03-24T02:28:56.514Z") }
{ "_id" : ObjectId("531e695aa6436643d8000342"), "campaign_id" : 1202142, "updated_at" : ISODate("2014-03-24T02:49:27.830Z") }
{ "_id" : ObjectId("531e69134c4fcd5bc7000192"), "campaign_id" : 1201956, "updated_at" : ISODate("2014-03-22T13:46:27.310Z") }
{ "_id" : ObjectId("531e6a0686cd346cae000835"), "campaign_id" : 1201976, "updated_at" : ISODate("2014-03-22T13:47:07.229Z") }
{ "_id" : ObjectId("531e69bb971d028be30004a3"), "campaign_id" : 1201956, "updated_at" : ISODate("2014-03-25T13:51:15.836Z") }
{ "_id" : ObjectId("531e69db2fc5cb4714001a52"), "campaign_id" : 1202003, "updated_at" : ISODate("2014-03-25T13:55:15.475Z") }
{ "_id" : ObjectId("531e6a4386cd3428490009ff"), "campaign_id" : 1202551, "updated_at" : ISODate("2014-03-22T00:03:35.261Z") }
Type "it" for more
blah:SECONDARY>

Have you manipulated data on any secondary node directly without going through the primary in the past?

No, this should be prevented by the ruby driver we use.

Was there any event you can think of that triggered these crashes? Upgrades, migration, etc.?

Yes, possible our migration. We reached the limit of 1TB Amazon EBS volumes and had to migrate to new hosts with striped EBS volumes (raid 0 mdadm devices). We recovered the whole cluster with a backup of a Secondary.

Can you attach the full log files of primary and secondaries reaching back several days if possible?

I need to check with management if I can publish our logs publicly here

Comment by Thomas Rueckstiess [ 09/Apr/14 ]

Hi,

The indexes are correct and match on all nodes.

It looks like the secondaries are trying to replicate an update that would cause a duplicate key violation on the secondaries. Have you manipulated data on any secondary node directly without going through the primary in the past? Was there any event you can think of that triggered these crashes? Upgrades, migration, etc.?

Can you attach the full log files of primary and secondaries reaching back several days if possible?

Can you also please run the following commands and paste the results in a reply:

On both the primary and the crashing secondary (or secondaries, if multiple secondaries are affected):

use reporting_events-production
db.tracking_game_campaigns.find({_id: ObjectId('531e6a0c365ad5221f00057a')})
db.tracking_game_campaigns.find({token: null, game_id: null})

This will show us the document where the update fails, and if there are any other documents with missing token/game_id. This should not be possible as the unique index prevents that. If the secondary shows inconsistent data, the best way forward is to resync the secondary from the primary.

Regards,
Thomas

Comment by Dev Ops [ 09/Apr/14 ]

the mongo secondaries crashed multiple times during the last 24 hours.
is there anything else I can provide to enable you analyzing this issue?

Comment by Dev Ops [ 08/Apr/14 ]

slave1:

blah:SECONDARY> rs.slaveOk();
blah:SECONDARY> db.tracking_game_campaigns.getIndexes()
[
        {
                "v" : 1,
                "key" : {
                        "_id" : 1
                },
                "ns" : "reporting_events-production.tracking_game_campaigns",
                "name" : "_id_"
        },
        {
                "v" : 1,
                "key" : {
                        "created_at" : 1
                },
                "ns" : "reporting_events-production.tracking_game_campaigns",
                "name" : "created_at_1"
        },
        {
                "v" : 1,
                "key" : {
                        "updated_at" : 1
                },
                "ns" : "reporting_events-production.tracking_game_campaigns",
                "name" : "updated_at_1"
        },
        {
                "v" : 1,
                "key" : {
                        "token" : 1,
                        "game_id" : 1
                },
                "unique" : true,
                "ns" : "reporting_events-production.tracking_game_campaigns",
                "name" : "token_1_game_id_1"
        },
        {
                "v" : 1,
                "key" : {
                        "expires_at" : 1
                },
                "ns" : "reporting_events-production.tracking_game_campaigns",
                "sparse" : true,
                "name" : "expires_at_1"
        }
]
blah:SECONDARY>

slave2:

blah:SECONDARY> rs.slaveOk();
blah:SECONDARY> db.tracking_game_campaigns.getIndexes()
[
        {
                "v" : 1,
                "key" : {
                        "_id" : 1
                },
                "ns" : "reporting_events-production.tracking_game_campaigns",
                "name" : "_id_"
        },
        {
                "v" : 1,
                "key" : {
                        "created_at" : 1
                },
                "ns" : "reporting_events-production.tracking_game_campaigns",
                "name" : "created_at_1"
        },
        {
                "v" : 1,
                "key" : {
                        "updated_at" : 1
                },
                "ns" : "reporting_events-production.tracking_game_campaigns",
                "name" : "updated_at_1"
        },
        {
                "v" : 1,
                "key" : {
                        "token" : 1,
                        "game_id" : 1
                },
                "unique" : true,
                "ns" : "reporting_events-production.tracking_game_campaigns",
                "name" : "token_1_game_id_1"
        },
        {
                "v" : 1,
                "key" : {
                        "expires_at" : 1
                },
                "ns" : "reporting_events-production.tracking_game_campaigns",
                "sparse" : true,
                "name" : "expires_at_1"
        }
]
 

Comment by Eliot Horowitz (Inactive) [ 08/Apr/14 ]

On the secondary you can do

rs.slaveOk();
db.tracking_game_campaigns.getIndexes()

Comment by Dev Ops [ 08/Apr/14 ]

master:

blah:PRIMARY> db.tracking_game_campaigns.getIndexes()
[
        {
                "v" : 1,
                "key" : {
                        "_id" : 1
                },
                "ns" : "reporting_events-production.tracking_game_campaigns",
                "name" : "_id_"
        },
        {
                "v" : 1,
                "key" : {
                        "created_at" : 1
                },
                "ns" : "reporting_events-production.tracking_game_campaigns",
                "name" : "created_at_1"
        },
        {
                "v" : 1,
                "key" : {
                        "updated_at" : 1
                },
                "ns" : "reporting_events-production.tracking_game_campaigns",
                "name" : "updated_at_1"
        },
        {
                "v" : 1,
                "key" : {
                        "token" : 1,
                        "game_id" : 1
                },
                "unique" : true,
                "ns" : "reporting_events-production.tracking_game_campaigns",
                "name" : "token_1_game_id_1"
        },
        {
                "v" : 1,
                "key" : {
                        "expires_at" : 1
                },
                "ns" : "reporting_events-production.tracking_game_campaigns",
                "sparse" : true,
                "name" : "expires_at_1"
        }
]

slave:

blah:SECONDARY> db.tracking_game_campaigns.getIndexes()
Tue Apr 8 21:11:00.909 error:

{ "$err" : "not master and slaveOk=false", "code" : 13435 }

at src/mongo/shell/query.js:12

Comment by Eliot Horowitz (Inactive) [ 08/Apr/14 ]

Can you send the indexes for reporting_events-production.tracking_game_campaigns on all the nodes in the set?

Generated at Thu Feb 08 03:31:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.