[SERVER-16488] Fatal Assertion 16967 during normal operation and repair Created: 10/Dec/14  Updated: 22/Jan/15  Resolved: 22/Jan/15

Status: Closed
Project: Core Server
Component/s: Stability, Storage
Affects Version/s: 2.6.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Nick Sturrock Assignee: Ramon Fernandez Marina
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu LTS 12.04 64-bit


Attachments: File diaglog.54886b00    
Operating System: Linux
Participants:

 Description   

I am currently recovering the primary node of a replica set following a disk fault. Having run the disk repair and restarted mongodb I am experiencing a crash with 'Fatal Assertion 16967'. Attempting to run a repair gives the same error. Full stack trace is below:

2014-12-10T09:54:26.453+0000 [initandlisten] buzzdeck.feed Fatal Assertion 16967
2014-12-10T09:54:26.484+0000 [initandlisten] buzzdeck.feed 0x11e9b11 0x118b849 0x116e37d 0xefb92d 0xefba94 0xf34cd7 0xe0509f 0x767fae 0x76ac5e 0x76c89f 0x76d14b 0x76d6e5 0x76d909 0x7fd9cd48276d 0x764589
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11e9b11]
/usr/bin/mongod(_ZN5mongo10logContextEPKc+0x159) [0x118b849]
/usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xcd) [0x116e37d]
/usr/bin/mongod() [0xefb92d]
/usr/bin/mongod(_ZNK5mongo13ExtentManager13getNextRecordERKNS_7DiskLocE+0x24) [0xefba94]
/usr/bin/mongod(_ZN5mongo12FlatIterator7getNextEv+0x97) [0xf34cd7]
/usr/bin/mongod(_ZN5mongo14repairDatabaseESsbb+0x24cf) [0xe0509f]
/usr/bin/mongod(_ZN5mongo11doDBUpgradeERKSsPNS_14DataFileHeaderE+0x5e) [0x767fae]
/usr/bin/mongod() [0x76ac5e]
/usr/bin/mongod(_ZN5mongo14_initAndListenEi+0x5df) [0x76c89f]
/usr/bin/mongod(_ZN5mongo13initAndListenEi+0x1b) [0x76d14b]
/usr/bin/mongod() [0x76d6e5]
/usr/bin/mongod(main+0x9) [0x76d909]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7fd9cd48276d]
/usr/bin/mongod() [0x764589]
2014-12-10T09:54:26.484+0000 [initandlisten]

***aborting after fassert() failure

2014-12-10T09:54:26.490+0000 [initandlisten] SEVERE: Got signal: 6 (Aborted).
Backtrace:0x11e9b11 0x11e8eee 0x7fd9cd4974a0 0x7fd9cd497425 0x7fd9cd49ab8b 0x116e3ea 0xefb92d 0xefba94 0xf34cd7 0xe0509f 0x767fae 0x76ac5e 0x76c89f 0x76d14b 0x76d6e5 0x76d909 0x7fd9cd48276d 0x764589
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11e9b11]
/usr/bin/mongod() [0x11e8eee]
/lib/x86_64-linux-gnu/libc.so.6(+0x364a0) [0x7fd9cd4974a0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7fd9cd497425]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x17b) [0x7fd9cd49ab8b]
/usr/bin/mongod(_ZN5mongo13fassertFailedEi+0x13a) [0x116e3ea]
/usr/bin/mongod() [0xefb92d]
/usr/bin/mongod(_ZNK5mongo13ExtentManager13getNextRecordERKNS_7DiskLocE+0x24) [0xefba94]
/usr/bin/mongod(_ZN5mongo12FlatIterator7getNextEv+0x97) [0xf34cd7]
/usr/bin/mongod(_ZN5mongo14repairDatabaseESsbb+0x24cf) [0xe0509f]
/usr/bin/mongod(_ZN5mongo11doDBUpgradeERKSsPNS_14DataFileHeaderE+0x5e) [0x767fae]
/usr/bin/mongod() [0x76ac5e]
/usr/bin/mongod(_ZN5mongo14_initAndListenEi+0x5df) [0x76c89f]
/usr/bin/mongod(_ZN5mongo13initAndListenEi+0x1b) [0x76d14b]
/usr/bin/mongod() [0x76d6e5]
/usr/bin/mongod(main+0x9) [0x76d909]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xed) [0x7fd9cd48276d]
/usr/bin/mongod() [0x764589]



 Comments   
Comment by Ramon Fernandez Marina [ 22/Jan/15 ]

nick.sturrock, we haven't heard back from you for a while, so I assume you were able to resync your secondaries by one of the means linked above. I'm now resolving this ticket, but feel free to reopen if this issue surfaces again.

Regards,
Ramón.

Comment by Ramon Fernandez Marina [ 11/Dec/14 ]

nick.sturrock, if initial sync is not working for you there are other methods to resync a replica set member, like copying the data files directly. If you want to try this approach I'd recommend you read about backup methods first.

Note that if the database files in the source node contain data corruption you may run into issues later on, so you may want to consider recovering from your latest backup to make sure your dataset is healthy.

Comment by Nick Sturrock [ 11/Dec/14 ]

Sadly the secondary was in the middle of a full resync when this problem occured. The primary node became totally unresponsive but didn't crash, so we had to do a manual reset which caused disk errors (and no doubt corruption in the data set). The secondary fell too far behind to catch up -at which point we should perhaps have made it the primary and suffered the data loss from the downtime, but instead we started a full resync so it's not in a good state. So we're currently running with a single flaky node that goes down for 2-3 seconds every 40 minutes or so - it seems this is enough to stop the resync from completing - it restarts every time the primary node gets restarted. Is there a way to make the resync resume from where it left off?

Comment by Daniel Pasette (Inactive) [ 10/Dec/14 ]

if you have a healthy secondary, you should do a fresh resync off that node rather than trying to repair the primary.

Comment by Nick Sturrock [ 10/Dec/14 ]

Attached is a level 3 diagnostic log taken in normal operation right up to crash point - not sure if this is useful or not, but its there if you want it

Comment by Nick Sturrock [ 10/Dec/14 ]

Tried to assign this to release 2.6.5, but it hasn't apparently been assigned in the ticket.

Generated at Thu Feb 08 03:41:12 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.