[SERVER-9293] Old primary crashed after stepDown - fatal assertion 16360 Created: 09/Apr/13  Updated: 10/Dec/14  Resolved: 10/Apr/13

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 2.4.1
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Chris Kimpton Assignee: Jeremy Mikola
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux/Ubuntu 12.04/OpenVZ


Attachments: File mongodb_crash.log.gz     File primary.log.gz     File secondary1.log.gz     File secondary2problem.log.gz    
Operating System: ALL
Participants:

 Description   

Hi,

When did a stepDown on the primary, it failed over to another node ok, but the old primary crashed.

When I tried to restart it, it also crashed

Regards,
Chris



 Comments   
Comment by Chris Kimpton [ 10/Apr/13 ]

Hi Jeremy,

Sure, thats fine.

On my wish list would be better validation of the local db when joining a replicaset to avoid the issues I had - but I appreciate that I should have been more careful.

Regards,
Chris

Comment by Jeremy Mikola [ 10/Apr/13 ]

Chris,

Glad to here. I'm going to resolve this, as there wasn't any indication of a bug at play and the inconsistent local DB was the likely culprit.

Comment by Chris Kimpton [ 10/Apr/13 ]

Hi Jeremy,

Ok - that seems to work better, I have now wiped the data directory and re-added the node to the replicaset.

Its caught up and none of the strangeness of yesterday.

Many Thanks for the help.

Regards,
Chris

Comment by Jeremy Mikola [ 10/Apr/13 ]

And yes, I did not drop the local db - what should I have done - completely emptied the /var/lib/mongodb directory?

You should wipe the data directory when creating a replica set member from scratch. The local database meta information for the replica set (e.g. configuration, oplog), so leaving that in place may cause unexpected results in the new replica set.

Comment by Chris Kimpton [ 10/Apr/13 ]

As a test of the dodgy node, I have setup a standalone mongo db and a copy of the app on the problem node, will see if that gets issues on its own.

Its started ok, setup the collections, with the correct indexes

Comment by Chris Kimpton [ 10/Apr/13 ]

Hi Jeremy,

The indexes are created by the app, on the primary - I am NOT currently creating anything manually on the secondaries. Index details listed at the bottom of this message. As the data I have is quite small (was several gig, but the DB was practically empty when I got the last crash), I wouldn't think I need to pre-create the indexes

I have checked the logs across all servers for "writer worker caught exception" and I dont see any new such errors... but then it depends on the data the app processes.

Yes - I had/have a replicaset of 4 members... guess I should not be doing that

I have added another secondary - so now I have 3 working members and the problem one. As I have had issues with this node and mongo before, I am thinking maybe there is a network issue with it - so I have raised a call with my service provider.

And yes, I did not drop the local db - what should I have done - completely emptied the /var/lib/mongodb directory?

Indexes from what was primary:

> db.registered.getIndexes();
[
{
"v" : 1,
"key" :

{ "_id" : 1 }

,
"ns" : "wowfeed.registered",
"name" : "id"
},
{
"v" : 1,
"key" :

{ "name" : 1, "realm" : 1, "region" : 1, "type" : 1 }

,
"unique" : true,
"ns" : "wowfeed.registered",
"name" : "name_1_realm_1_region_1_type_1",
"safe" : true
},
{
"v" : 1,
"key" :

{ "updated_at" : 1 }

,
"ns" : "wowfeed.registered",
"name" : "updated_at_1",
"expireAfterSeconds" : 604800,
"safe" : true
}
]
> db.armory_items.getIndexes();
[
{
"v" : 1,
"key" :

{ "_id" : 1 }

,
"ns" : "wowfeed.armory_items",
"name" : "id"
},
{
"v" : 1,
"key" :

{ "item_id" : 1 }

,
"unique" : true,
"ns" : "wowfeed.armory_items",
"name" : "item_id_1",
"dropDups" : true,
"safe" : true
}
]

secondary, as was, but now primary

> db.registered.getIndexes();
[
{
"v" : 1,
"key" :

{ "_id" : 1 }

,
"ns" : "wowfeed.registered",
"name" : "id"
},
{
"v" : 1,
"key" :

{ "name" : 1, "realm" : 1, "region" : 1, "type" : 1 }

,
"unique" : true,
"ns" : "wowfeed.registered",
"name" : "name_1_realm_1_region_1_type_1",
"safe" : true
},
{
"v" : 1,
"key" :

{ "updated_at" : 1 }

,
"ns" : "wowfeed.registered",
"name" : "updated_at_1",
"expireAfterSeconds" : 604800,
"safe" : true
}
]

> db.armory_items.getIndexes();
[
{
"v" : 1,
"key" :

{ "_id" : 1 }

,
"ns" : "wowfeed.armory_items",
"name" : "id"
},
{
"v" : 1,
"key" :

{ "item_id" : 1 }

,
"unique" : true,
"ns" : "wowfeed.armory_items",
"name" : "item_id_1",
"dropDups" : true,
"safe" : true
}
]

crashed secondary, now out of replset

> db.registered.getIndexes();
[
{
"v" : 1,
"key" :

{ "_id" : 1 }

,
"ns" : "wowfeed.registered",
"name" : "id"
}
]

> db.armory_items.getIndexes();
[
{
"v" : 1,
"key" :

{ "_id" : 1 }

,
"ns" : "wowfeed.armory_items",
"name" : "id"
}
]

other secondary I added later:

> db.registered.getIndexes();
[
{
"v" : 1,
"key" :

{ "_id" : 1 }

,
"ns" : "wowfeed.registered",
"name" : "id"
},
{
"v" : 1,
"key" :

{ "name" : 1, "realm" : 1, "region" : 1, "type" : 1 }

,
"unique" : true,
"ns" : "wowfeed.registered",
"name" : "name_1_realm_1_region_1_type_1",
"safe" : true
},
{
"v" : 1,
"key" :

{ "updated_at" : 1 }

,
"ns" : "wowfeed.registered",
"name" : "updated_at_1",
"expireAfterSeconds" : 604800,
"safe" : true
}
]

> db.armory_items.getIndexes();
[
{
"v" : 1,
"key" :

{ "_id" : 1 }

,
"ns" : "wowfeed.armory_items",
"name" : "id"
},
{
"v" : 1,
"key" :

{ "item_id" : 1 }

,
"unique" : true,
"ns" : "wowfeed.armory_items",
"name" : "item_id_1",
"dropDups" : true,
"safe" : true
}
]

Comment by Jeremy Mikola [ 09/Apr/13 ]

Also, in your earlier steps for recreating the replica set, you mentioned dropping the database. In that case, the local database (with its oplog) may have remained in place from the previous replica set, which could cause strange behavior. It would be preferable to wipe the entire database directory when recreating a replica set member.

Comment by Jeremy Mikola [ 09/Apr/13 ]

Grepping for "writer worker caught exception" across those log files turned up the following in order:

  • Primary
    • Tue Apr 9 07:16:41.864: E11000 duplicate key error index: wowfeed.registered
    • Tue Apr 9 07:30:01.158: E11000 duplicate key error index: wowfeed.registered
    • Tue Apr 9 07:37:31.325: E11000 duplicate key error index: wowfeed.registered
  • Secondary 1
    • Tue Apr 9 08:16:26.670: E11000 duplicate key error index: wowfeed.armory_items.$item_id_1
    • Tue Apr 9 09:34:43.378: E11000 duplicate key error index: wowfeed.armory_items.$item_id_1
    • Tue Apr 9 10:01:44.490: E11000 duplicate key error index: wowfeed.armory_items.$item_id_1
    • Tue Apr 9 19:04:31.795: E11000 duplicate key error index: wowfeed.registered.$name_1_realm_1_region_1_type_1
  • Secondary 2 (most recent failures)
    • Tue Apr 9 11:40:15.143: E11000 duplicate key error index: wowfeed.armory_items
    • Tue Apr 9 20:34:41.993: E11000 duplicate key error index: wowfeed.registered
    • Tue Apr 9 20:37:55.917: E11000 duplicate key error index: wowfeed.registered

There were also a few "interrupted at shutdown" errors (during insertions to wowfeed.armory_calls) on secondary 2, but they were from last week and I trimmed them.

Can you confirm once again that the indexes for wowfeed.armory_items and wowfeed.registered are consistent across all members in the replica set? Based on an earlier comment you made, not all servers had the same unique indexes defined.

To your earlier question about indexing, MongoDB does replicate index creation across a replica set, but there is also a suggested procedure (see: here) for manually creating indexes on secondaries to avoid impacting performance.

3 (of 4) were down except for the primary, but it was acting as secondary due to lack of quorum.

Regarding "of 4", does your replica have an even number of members?

Comment by Chris Kimpton [ 09/Apr/13 ]

server logs

Comment by Chris Kimpton [ 09/Apr/13 ]

Hi Scott,

Thanks for re-opening the ticket - apologies again for hassling you on twitter - that was a bit cheeky, wasnt sure if you'd get messages on closed item...

I rebuilt the set as follows:

  • 3 (of 4) were down except for the primary, but it was acting as secondary due to lack of quorum.
  • I stopped the app, then logged into mongo shell and setup a new config for the set with only one now
  • applied/forced new config: rs.reconfig(cfg2, {force:true}

    )

-(1) connected to app db: use wowfeed

  • dropped db: db.dropDatabase()
  • reconnected to app db: use wowfeed
  • add db user: db.addUser('...','....') - security is turned on
  • started up the app and used it a bit, checked that indexes get setup correctly. Seemed to be working, correct indexes created on all collections, etc.

So, I added first secondary, as follows:

  • log onto box, amend /etc/mongodb.conf to make db not part of replset
  • startup mongodb: service mongodb start
  • as above from point (1), connected to db, dropped db, readded the user for db
  • then amended /etc/mongodb.conf back the way it was and restarted mongodb.

Then in the mongo shell against the primary, I added the node: rs.add('....')

I waited a bit for it to settle down - seems to be running ok.

For the next, problem node, I did the same as above, take it out of the replset, drop db, re-add it.

Then it died.

Logs from all 3 db's will be attached.

Comment by Scott Hernandez (Inactive) [ 09/Apr/13 ]

Can you attach the logs from each node please?

Also, how are you building this replica set? Please list each step taken so we can accurately account for each part of the process.

Comment by Chris Kimpton [ 09/Apr/13 ]

On the problem node, the registered indexes look like below. Shouldn't the MongoDB replica process make sure the indexes are in place on the replicas before adding data?

> db.registered.getIndexes()
[
{
"v" : 1,
"key" :

{ "_id" : 1 }

,
"ns" : "wowfeed.registered",
"name" : "id"
}
]

compared to this on primary:

> db.registered.getIndexes()
[
{
"v" : 1,
"key" :

{ "_id" : 1 }

,
"ns" : "wowfeed.registered",
"name" : "id"
},
{
"v" : 1,
"key" :

{ "name" : 1, "realm" : 1, "region" : 1, "type" : 1 }

,
"unique" : true,
"ns" : "wowfeed.registered",
"name" : "name_1_realm_1_region_1_type_1",
"safe" : true
},
{
"v" : 1,
"key" :

{ "updated_at" : 1 }

,
"ns" : "wowfeed.registered",
"name" : "updated_at_1",
"expireAfterSeconds" : 604800,
"safe" : true
}
]

Comment by Chris Kimpton [ 09/Apr/13 ]

Its gone again, while adding 3rd node to the replicaset:

Tue Apr 9 20:37:51.856 [conn7] end connection 5.9.133.180:47460 (1 connection now open)
Tue Apr 9 20:37:53.478 [rsHealthPoll] replset info allyourvps1:27017 thinks that we are down
Tue Apr 9 20:37:53.877 [initandlisten] connection accepted from 5.9.133.180:47461 #8 (2 connections now open)
Tue Apr 9 20:37:55.915 [repl writer worker 1] build index wowfeed.registered

{ name: 1, realm: 1, region: 1, type: 1 }

Tue Apr 9 20:37:55.917 [repl writer worker 1] ERROR: writer worker caught exception: E11000 duplicate key error index: wowfeed.registered.$name_1_realm_1_region_1_type_1 dup key: { : null, : null, : null, : null } on: { ts: Timestamp 1365530669000|2, h: -7971887077277862493, v: 2, op: "i", ns: "wowfeed.system.indexes", o: { ns: "wowfeed.registered", key:

{ name: 1, realm: 1, region: 1, type: 1 }

, name: "name_1_realm_1_region_1_type_1", unique: true, safe: true } }
Tue Apr 9 20:37:55.917 [repl writer worker 1] Fatal Assertion 16360
0xdc7f71 0xd87cf3 0xc1b29c 0xd95821 0xe10879 0x7f8e4e16be9a 0x7f8e4d47ecbd
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdc7f71]
/usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0xd87cf3]
/usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x12c) [0xc1b29c]
/usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x281) [0xd95821]
/usr/bin/mongod() [0xe10879]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f8e4e16be9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f8e4d47ecbd]
Tue Apr 9 20:37:55.919 [repl writer worker 1]

***aborting after fassert() failure

Tue Apr 9 20:37:55.919 Got signal: 6 (Aborted).

Tue Apr 9 20:37:55.921 Backtrace:
0xdc7f71 0x6ce459 0x7f8e4d3c14a0 0x7f8e4d3c1425 0x7f8e4d3c4b8b 0xd87d2e 0xc1b29c 0xd95821 0xe10879 0x7f8e4e16be9a 0x7f8e4d47ecbd
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdc7f71]
/usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x6ce459]
/lib/x86_64-linux-gnu/libc.so.6(+0x364a0) [0x7f8e4d3c14a0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7f8e4d3c1425]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x17b) [0x7f8e4d3c4b8b]
/usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xde) [0xd87d2e]
/usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x12c) [0xc1b29c]
/usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x281) [0xd95821]
/usr/bin/mongod() [0xe10879]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f8e4e16be9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f8e4d47ecbd]

Comment by Chris Kimpton [ 09/Apr/13 ]

FWIW, I have been using 2.4.1 for a few weeks without an obvious problem, but connecting directly to the primary - last night/this morning I finally got my code connecting to the replicaset and thats when these problems started Or at least enough to affect things.

Comment by Chris Kimpton [ 09/Apr/13 ]

Hi Scott,

Good point - I presume its missing from all nodes.

I see that the nodes have gone again, see below.

Think its time to start from scratch.

Cheers,
Chris

Tue Apr 9 11:40:15.143 [repl writer worker 1] ERROR: writer worker caught exception: E11000 duplicate key error index: wowfeed.armory_items.$item_id_1 dup key: { : 91145 } on: { ts: Timestamp
1365504011000|23, h: 1374244527477853697, v: 2, op: "i", ns: "wowfeed.armory_items", o: { id: 91145, disenchantingSkillRank: 575, description: "", name: "Tyrannical Gladiator's Heavy Crossbow"
, icon: "inv_bow_2h_crossbow_pvppandarias2_c_01", stackable: 1, allowableRaces: [ 7, 1, 11, 22, 4, 3, 25 ], itemBind: 1, bonusStats: [

{ stat: 35, amount: 793 }

,

{ stat: 57, amount: 5912 }

,

{ s tat: 32, amount: 793 }

,

{ stat: 3, amount: 1190 }

,

{ stat: 7, amount: 1784 }

], itemSpells: {}, buyPrice: 1754796, itemClass: 2, itemSubClass: 18, containerSlots: 0, weaponInfo: { damage:

{ min : 10427, max: 15643, exactMin: 10427.985, exactMax: 15642 }

, weaponSpeed: 3, dps: 4344.9976 }, inventoryType: 26, equippable: true, itemLevel: 493, maxCount: 0, maxDurability: 120, minFactionId: 0, minReputation: 0, quality: 4, sellPrice: 350959, requiredSkill: 0, requiredLevel: 90, requiredSkillRank: 0, itemSource:

{ sourceId: 69967, sourceType: "VENDOR" }

, baseArmor: 0, hasSockets: false, isAuctionable: false, armor: 0, displayInfoId: 119143, nameDescription: "Season 13", nameDescriptionColor: "00ff00", upgradable: true, heroicTooltip: false, item_id: 91145, _id: ObjectId('5163f00b8ad60bf35e00007b') } }
Tue Apr 9 11:40:15.143 [repl writer worker 1] Fatal Assertion 16360
0xdc7f71 0xd87cf3 0xc1b29c 0xd95821 0xe10879 0x7f0874a90e9a 0x7f0873da3cbd
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdc7f71]
/usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0xd87cf3]

Comment by Scott Hernandez (Inactive) [ 09/Apr/13 ]

Sorry to hear that.

It looks like the offending unique index:

{name:1, realm:1, region:1, type:1}

is missing in that list. It looks like it was either removed or somehow is inconsistent on different members of the replica set.

In MMS you can see a graph with the replication lag so you can see if you get in that situation again. As you know the state (SECONDARY) doesn't reflect how far behind it is.

Comment by Chris Kimpton [ 09/Apr/13 ]

Hi Scott,

Hmm - probably not worth printing indexes from all nodes as since I posted this, all nodes have crashed and I have rebuilt them all - but it seems I did it too quickly, not enough time to sync up and so all the data was lost . Foolishly when I saw the stateStr was SECONDARY, I thought it meant it was in-sync - but I guess that is not the case.

The 'registered' collection indexes on the current primary are below. They are built/checked by the app each time it accesses the collection - so not built manually on the secondaries.

Also no backups available - thats another thing I have been trying to setup (actually posted an item on the google group about a crash I was getting with mongodump)

For my app, losing the data is not too bad - no paying costs/money lost - its a toy Warcraft related site. As people query it, it will then re-query the Warcraft API and reload the DB.

I have also added MMS monitoring too - although it only seems to be checking the primary so far... not sure if I need to do something for it to pickup the secondaries. One of them seems to have dropped off like it did last week too

Regards,
Chris

wowfeed:PRIMARY> db.registered.getIndexes()
[
{
"v" : 1,
"key" :

{ "_id" : 1 }

,
"ns" : "wowfeed.registered",
"name" : "id"
},
{
"v" : 1,
"key" :

{ "updated_at" : 1 }

,
"ns" : "wowfeed.registered",
"name" : "updated_at_1",
"expireAfterSeconds" : 604800,
"safe" : true
}
]

Comment by Scott Hernandez (Inactive) [ 09/Apr/13 ]

Can you post the indexes for your "registered" collection in the "wowfeed" database from all your nodes?

> use wowfeed
> db.registered.getIndexes()

Unfortunately without the data from the primary it is hard to tell but the error seems like the unique index contraint wasn't on that node. How did you build the indexes? Was it possible that you built the indexes manually and directly on the secondaries separately from the primary?

Do you have a backup from the primary node which contains an earlier version of the data which would contain the index definitions?

Comment by Chris Kimpton [ 09/Apr/13 ]

To work around this, I removed node from replica set, dropped the db and then reconnected to the replset.

Seems ok now...

Comment by Chris Kimpton [ 09/Apr/13 ]

Initial log with crash when node stepped down

Generated at Thu Feb 08 03:19:58 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.