[SERVER-16488] Fatal Assertion 16967 during normal operation and repair Created: 10/Dec/14 Updated: 22/Jan/15 Resolved: 22/Jan/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Stability, Storage |
| Affects Version/s: | 2.6.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Nick Sturrock | Assignee: | Ramon Fernandez Marina |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Ubuntu LTS 12.04 64-bit |
||
| Attachments: |
|
| Operating System: | Linux |
| Participants: |
| Description |
|
I am currently recovering the primary node of a replica set following a disk fault. Having run the disk repair and restarted mongodb I am experiencing a crash with 'Fatal Assertion 16967'. Attempting to run a repair gives the same error. Full stack trace is below: 2014-12-10T09:54:26.453+0000 [initandlisten] buzzdeck.feed Fatal Assertion 16967 ***aborting after fassert() failure 2014-12-10T09:54:26.490+0000 [initandlisten] SEVERE: Got signal: 6 (Aborted). |
| Comments |
| Comment by Ramon Fernandez Marina [ 22/Jan/15 ] |
|
nick.sturrock, we haven't heard back from you for a while, so I assume you were able to resync your secondaries by one of the means linked above. I'm now resolving this ticket, but feel free to reopen if this issue surfaces again. Regards, |
| Comment by Ramon Fernandez Marina [ 11/Dec/14 ] |
|
nick.sturrock, if initial sync is not working for you there are other methods to resync a replica set member, like copying the data files directly. If you want to try this approach I'd recommend you read about backup methods first. Note that if the database files in the source node contain data corruption you may run into issues later on, so you may want to consider recovering from your latest backup to make sure your dataset is healthy. |
| Comment by Nick Sturrock [ 11/Dec/14 ] |
|
Sadly the secondary was in the middle of a full resync when this problem occured. The primary node became totally unresponsive but didn't crash, so we had to do a manual reset which caused disk errors (and no doubt corruption in the data set). The secondary fell too far behind to catch up -at which point we should perhaps have made it the primary and suffered the data loss from the downtime, but instead we started a full resync so it's not in a good state. So we're currently running with a single flaky node that goes down for 2-3 seconds every 40 minutes or so - it seems this is enough to stop the resync from completing - it restarts every time the primary node gets restarted. Is there a way to make the resync resume from where it left off? |
| Comment by Daniel Pasette (Inactive) [ 10/Dec/14 ] |
|
if you have a healthy secondary, you should do a fresh resync off that node rather than trying to repair the primary. |
| Comment by Nick Sturrock [ 10/Dec/14 ] |
|
Attached is a level 3 diagnostic log taken in normal operation right up to crash point - not sure if this is useful or not, but its there if you want it |
| Comment by Nick Sturrock [ 10/Dec/14 ] |
|
Tried to assign this to release 2.6.5, but it hasn't apparently been assigned in the ticket. |