[SERVER-9293] Old primary crashed after stepDown - fatal assertion 16360 Created: 09/Apr/13 Updated: 10/Dec/14 Resolved: 10/Apr/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 2.4.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Chris Kimpton | Assignee: | Jeremy Mikola |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Linux/Ubuntu 12.04/OpenVZ |
||
| Attachments: |
|
| Operating System: | ALL |
| Participants: |
| Description |
|
Hi, When did a stepDown on the primary, it failed over to another node ok, but the old primary crashed. When I tried to restart it, it also crashed Regards, |
| Comments |
| Comment by Chris Kimpton [ 10/Apr/13 ] | ||
|
Hi Jeremy, Sure, thats fine. On my wish list would be better validation of the local db when joining a replicaset to avoid the issues I had - but I appreciate that I should have been more careful. Regards, | ||
| Comment by Jeremy Mikola [ 10/Apr/13 ] | ||
|
Chris, Glad to here. I'm going to resolve this, as there wasn't any indication of a bug at play and the inconsistent local DB was the likely culprit. | ||
| Comment by Chris Kimpton [ 10/Apr/13 ] | ||
|
Hi Jeremy, Ok - that seems to work better, I have now wiped the data directory and re-added the node to the replicaset. Its caught up and none of the strangeness of yesterday. Many Thanks for the help. Regards, | ||
| Comment by Jeremy Mikola [ 10/Apr/13 ] | ||
You should wipe the data directory when creating a replica set member from scratch. The local database meta information for the replica set (e.g. configuration, oplog), so leaving that in place may cause unexpected results in the new replica set. | ||
| Comment by Chris Kimpton [ 10/Apr/13 ] | ||
|
As a test of the dodgy node, I have setup a standalone mongo db and a copy of the app on the problem node, will see if that gets issues on its own. Its started ok, setup the collections, with the correct indexes | ||
| Comment by Chris Kimpton [ 10/Apr/13 ] | ||
|
Hi Jeremy, The indexes are created by the app, on the primary - I am NOT currently creating anything manually on the secondaries. Index details listed at the bottom of this message. As the data I have is quite small (was several gig, but the DB was practically empty when I got the last crash), I wouldn't think I need to pre-create the indexes I have checked the logs across all servers for "writer worker caught exception" and I dont see any new such errors... but then it depends on the data the app processes. Yes - I had/have a replicaset of 4 members... guess I should not be doing that I have added another secondary - so now I have 3 working members and the problem one. As I have had issues with this node and mongo before, I am thinking maybe there is a network issue with it - so I have raised a call with my service provider. And yes, I did not drop the local db - what should I have done - completely emptied the /var/lib/mongodb directory? Indexes from what was primary: > db.registered.getIndexes(); , , , , , secondary, as was, but now primary > db.registered.getIndexes(); , , , > db.armory_items.getIndexes(); , , crashed secondary, now out of replset > db.registered.getIndexes(); , > db.armory_items.getIndexes(); , other secondary I added later: > db.registered.getIndexes(); , , , > db.armory_items.getIndexes(); , , | ||
| Comment by Jeremy Mikola [ 09/Apr/13 ] | ||
|
Also, in your earlier steps for recreating the replica set, you mentioned dropping the database. In that case, the local database (with its oplog) may have remained in place from the previous replica set, which could cause strange behavior. It would be preferable to wipe the entire database directory when recreating a replica set member. | ||
| Comment by Jeremy Mikola [ 09/Apr/13 ] | ||
|
Grepping for "writer worker caught exception" across those log files turned up the following in order:
There were also a few "interrupted at shutdown" errors (during insertions to wowfeed.armory_calls) on secondary 2, but they were from last week and I trimmed them. Can you confirm once again that the indexes for wowfeed.armory_items and wowfeed.registered are consistent across all members in the replica set? Based on an earlier comment you made, not all servers had the same unique indexes defined. To your earlier question about indexing, MongoDB does replicate index creation across a replica set, but there is also a suggested procedure (see: here) for manually creating indexes on secondaries to avoid impacting performance.
Regarding "of 4", does your replica have an even number of members? | ||
| Comment by Chris Kimpton [ 09/Apr/13 ] | ||
|
server logs | ||
| Comment by Chris Kimpton [ 09/Apr/13 ] | ||
|
Hi Scott, Thanks for re-opening the ticket - apologies again for hassling you on twitter - that was a bit cheeky, wasnt sure if you'd get messages on closed item... I rebuilt the set as follows:
-(1) connected to app db: use wowfeed
So, I added first secondary, as follows:
Then in the mongo shell against the primary, I added the node: rs.add('....') I waited a bit for it to settle down - seems to be running ok. For the next, problem node, I did the same as above, take it out of the replset, drop db, re-add it. Then it died. Logs from all 3 db's will be attached. | ||
| Comment by Scott Hernandez (Inactive) [ 09/Apr/13 ] | ||
|
Can you attach the logs from each node please? Also, how are you building this replica set? Please list each step taken so we can accurately account for each part of the process. | ||
| Comment by Chris Kimpton [ 09/Apr/13 ] | ||
|
On the problem node, the registered indexes look like below. Shouldn't the MongoDB replica process make sure the indexes are in place on the replicas before adding data? > db.registered.getIndexes() , compared to this on primary: > db.registered.getIndexes() , , , | ||
| Comment by Chris Kimpton [ 09/Apr/13 ] | ||
|
Its gone again, while adding 3rd node to the replicaset: Tue Apr 9 20:37:51.856 [conn7] end connection 5.9.133.180:47460 (1 connection now open) Tue Apr 9 20:37:55.917 [repl writer worker 1] ERROR: writer worker caught exception: E11000 duplicate key error index: wowfeed.registered.$name_1_realm_1_region_1_type_1 dup key: { : null, : null, : null, : null } on: { ts: Timestamp 1365530669000|2, h: -7971887077277862493, v: 2, op: "i", ns: "wowfeed.system.indexes", o: { ns: "wowfeed.registered", key: { name: 1, realm: 1, region: 1, type: 1 }, name: "name_1_realm_1_region_1_type_1", unique: true, safe: true } } ***aborting after fassert() failure Tue Apr 9 20:37:55.919 Got signal: 6 (Aborted). Tue Apr 9 20:37:55.921 Backtrace: | ||
| Comment by Chris Kimpton [ 09/Apr/13 ] | ||
|
FWIW, I have been using 2.4.1 for a few weeks without an obvious problem, but connecting directly to the primary - last night/this morning I finally got my code connecting to the replicaset and thats when these problems started | ||
| Comment by Chris Kimpton [ 09/Apr/13 ] | ||
|
Hi Scott, Good point - I presume its missing from all nodes. I see that the nodes have gone again, see below. Think its time to start from scratch. Cheers, Tue Apr 9 11:40:15.143 [repl writer worker 1] ERROR: writer worker caught exception: E11000 duplicate key error index: wowfeed.armory_items.$item_id_1 dup key: { : 91145 } on: { ts: Timestamp , { stat: 57, amount: 5912 }, { s tat: 32, amount: 793 }, { stat: 3, amount: 1190 }, { stat: 7, amount: 1784 }], itemSpells: {}, buyPrice: 1754796, itemClass: 2, itemSubClass: 18, containerSlots: 0, weaponInfo: { damage: { min : 10427, max: 15643, exactMin: 10427.985, exactMax: 15642 }, weaponSpeed: 3, dps: 4344.9976 }, inventoryType: 26, equippable: true, itemLevel: 493, maxCount: 0, maxDurability: 120, minFactionId: 0, minReputation: 0, quality: 4, sellPrice: 350959, requiredSkill: 0, requiredLevel: 90, requiredSkillRank: 0, itemSource: { sourceId: 69967, sourceType: "VENDOR" }, baseArmor: 0, hasSockets: false, isAuctionable: false, armor: 0, displayInfoId: 119143, nameDescription: "Season 13", nameDescriptionColor: "00ff00", upgradable: true, heroicTooltip: false, item_id: 91145, _id: ObjectId('5163f00b8ad60bf35e00007b') } } | ||
| Comment by Scott Hernandez (Inactive) [ 09/Apr/13 ] | ||
|
Sorry to hear that. It looks like the offending unique index:
is missing in that list. It looks like it was either removed or somehow is inconsistent on different members of the replica set. In MMS you can see a graph with the replication lag so you can see if you get in that situation again. As you know the state (SECONDARY) doesn't reflect how far behind it is. | ||
| Comment by Chris Kimpton [ 09/Apr/13 ] | ||
|
Hi Scott, Hmm - probably not worth printing indexes from all nodes as since I posted this, all nodes have crashed and I have rebuilt them all - but it seems I did it too quickly, not enough time to sync up and so all the data was lost The 'registered' collection indexes on the current primary are below. They are built/checked by the app each time it accesses the collection - so not built manually on the secondaries. Also no backups available - thats another thing I have been trying to setup (actually posted an item on the google group about a crash I was getting with mongodump) For my app, losing the data is not too bad - no paying costs/money lost - its a toy Warcraft related site. As people query it, it will then re-query the Warcraft API and reload the DB. I have also added MMS monitoring too - although it only seems to be checking the primary so far... not sure if I need to do something for it to pickup the secondaries. One of them seems to have dropped off like it did last week too Regards, wowfeed:PRIMARY> db.registered.getIndexes() , , | ||
| Comment by Scott Hernandez (Inactive) [ 09/Apr/13 ] | ||
|
Can you post the indexes for your "registered" collection in the "wowfeed" database from all your nodes?
Unfortunately without the data from the primary it is hard to tell but the error seems like the unique index contraint wasn't on that node. How did you build the indexes? Was it possible that you built the indexes manually and directly on the secondaries separately from the primary? Do you have a backup from the primary node which contains an earlier version of the data which would contain the index definitions? | ||
| Comment by Chris Kimpton [ 09/Apr/13 ] | ||
|
To work around this, I removed node from replica set, dropped the db and then reconnected to the replset. Seems ok now... | ||
| Comment by Chris Kimpton [ 09/Apr/13 ] | ||
|
Initial log with crash when node stepped down |