[SERVER-16629] primary server fassert() with fatal assertion 16967 Created: 22/Dec/14  Updated: 09/Apr/15  Resolved: 02/Mar/15

Status: Closed
Project: Core Server
Component/s: Stability
Affects Version/s: 2.6.6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Boris HUISGEN Assignee: Ramon Fernandez Marina
Resolution: Incomplete Votes: 0
Labels: crash, replicaset, replication
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Debian 7.7 / MongoDB 2.6.6 from 10gen repository


Operating System: Linux
Participants:

 Description   

We have a replicat set of 3 servers MongoDB 2.6.6.

Yesterday the primary server has crashed with this assertion :

Dec 21 21:27:21 localhost mongod.27017[2702]: [TTLMonitor] prod_front.sessions Fatal Assertion 16967
Dec 21 21:27:21 localhost mongod.27017[2702]: [TTLMonitor] prod_front.sessions 0x11fca91 0x119e889 0x11813bd 0xefd5ed 0xefd754 0xf3c972 0x8b8566 0xc1ccfa 0xc1bcce 0xf3f772 0xf41781 0x1184572 0x1241429 0x7f880b19db50 0x7f880a5407bd
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11fca91]
/usr/bin/mongod(_ZN5mongo10logContextEPKc+0x159) [0x119e889]
/usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xcd) [0x11813bd]
/usr/bin/mongod() [0xefd5ed]
/usr/bin/mongod(_ZNK5mongo13ExtentManager13getNextRecordERKNS_7DiskLocE+0x24) [0xefd754]
/usr/bin/mongod(_ZN5mongo17RecordStoreV1Base12deleteRecordERKNS_7DiskLocE+0xb2) [0xf3c972]
/usr/bin/mongod(_ZN5mongo10Collection14deleteDocumentERKNS_7DiskLocEbbPNS_7BSONObjE+0x636) [0x8b8566]
/usr/bin/mongod(_ZN5mongo14DeleteExecutor7executeEv+0x9da) [0xc1ccfa]
/usr/bin/mongod(_ZN5mongo13deleteObjectsERKNS_10StringDataENS_7BSONObjEbbb+0x16e) [0xc1bcce]
/usr/bin/mongod(_ZN5mongo10TTLMonitor10doTTLForDBERKSs+0xd72) [0xf3f772]
/usr/bin/mongod(_ZN5mongo10TTLMonitor3runEv+0x4a1) [0xf41781]
/usr/bin/mongod(_ZN5mongo13BackgroundJob7jobBodyEv+0xd2) [0x1184572]
/usr/bin/mongod() [0x1241429]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x6b50) [0x7f880b19db50]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f880a5407bd]
Dec 21 21:27:21 localhost mongod.27017[2702]: [TTLMonitor] 
 
***aborting after fassert() failure
 
 
Dec 21 21:27:21 localhost mongod.27017[2702]: [TTLMonitor] Got signal: 6 (Aborted).
Backtrace:0x11fca91 0x11fbe6e 0x7f880a4961e0 0x7f880a496165 0x7f880a4993e0 0x118142a 0xefd5ed 0xefd754 0xf3c972 0x8b8566 0xc1ccfa 0xc1bcce 0xf3f772 0xf41781 0x1184572 0x1241429 0x7f880b19db50 0x7f880a5407bd 
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11fca91]
/usr/bin/mongod() [0x11fbe6e]
/lib/x86_64-linux-gnu/libc.so.6(+0x321e0) [0x7f880a4961e0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7f880a496165]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x180) [0x7f880a4993e0]
/usr/bin/mongod(_ZN5mongo13fassertFailedEi+0x13a) [0x118142a]
/usr/bin/mongod() [0xefd5ed]
/usr/bin/mongod(_ZNK5mongo13ExtentManager13getNextRecordERKNS_7DiskLocE+0x24) [0xefd754]
/usr/bin/mongod(_ZN5mongo17RecordStoreV1Base12deleteRecordERKNS_7DiskLocE+0xb2) [0xf3c972]
/usr/bin/mongod(_ZN5mongo10Collection14deleteDocumentERKNS_7DiskLocEbbPNS_7BSONObjE+0x636) [0x8b8566]
/usr/bin/mongod(_ZN5mongo14DeleteExecutor7executeEv+0x9da) [0xc1ccfa]
/usr/bin/mongod(_ZN5mongo13deleteObjectsERKNS_10StringDataENS_7BSONObjEbbb+0x16e) [0xc1bcce]
/usr/bin/mongod(_ZN5mongo10TTLMonitor10doTTLForDBERKSs+0xd72) [0xf3f772]
/usr/bin/mongod(_ZN5mongo10TTLMonitor3runEv+0x4a1) [0xf41781]
/usr/bin/mongod(_ZN5mongo13BackgroundJob7jobBodyEv+0xd2) [0x1184572]
/usr/bin/mongod() [0x1241429]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x6b50) [0x7f880b19db50]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f880a5407bd]

Server won't restart (same assertion after 10/15 seconds).

Is it a data / indexes corruption ? I'm already on production, any help will be appreciate ...



 Comments   
Comment by Ramon Fernandez Marina [ 23/Jan/15 ]

bhuisgen, are you still running into this issue or have you been able to re-sync this node from a healthy primary? If this is still an issue for you, can you please upload full logs for the affected node(s) when it happens?

Thanks,
Ramón.

Comment by Daniel Pasette (Inactive) [ 23/Dec/14 ]

Many deployments in AWS using EBS. Should not ordinarily be a correctness problem as long as the performance characteristics are fine for your use case.

Comment by Boris HUISGEN [ 23/Dec/14 ]

Ok, it's probably a problem with EBS disks (EC2 instances type m3.medium with a dedicated EBS for mongo). Are EBS disks really advised for a stable environment ? The easiest for me is to move to local SSD disks...

Comment by Ramon Fernandez Marina [ 22/Dec/14 ]

I forgot to add: there's an open ticket (SERVER-15759) to improve the behavior of --repair in these circumstances.

Also, the lines above the "Fatal Assertion 16967" error message typically contain further information. In fact, it would be good if you could upload logs from this primary from startup to fassert(), to try to rule out other issues.

Comment by Ramon Fernandez Marina [ 22/Dec/14 ]

Hi bhuisgen, this assertion may indeed be triggered by data corruption, often caused by a faulty disk. If you run mongod --repair chances are you'll see the same assertion, but since you have a replica set I'd recommend you resync from a healthy node. Please check the health of your disks as well.

Generated at Thu Feb 08 03:41:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.