[SERVER-18751] "Fatal Assertion" Crashed 1 Members In 3 Replica Set Shard Environment Created: 30/May/15  Updated: 14/Apr/16  Resolved: 19/Jun/15

Status: Closed
Project: Core Server
Component/s: Admin, Replication
Affects Version/s: 2.6.8
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Ovidiu Tatar Assignee: Ramon Fernandez Marina
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: Linux
Steps To Reproduce:

I tried to fix issue by using different repair path and start replica for synchronisation in this case get the next message:

2015-05-29T23:20:38.122-0700 [rsSync] crawler.urls 0x1205431 0x11a7229 0x118b53e 0x793b27 0x79f389 0x79f4b1 0xc36869 0xc5a485 0xc53cd6 0xc5f7a6 0xc4e447 0xe5ca76 0xec5949 0xea1e04 0xea3177 0xec3844 0xec38f0 0xec3c0a 0x1249dc9 0x33188079d1 
 mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x1205431]
 mongod(_ZN5mongo10logContextEPKc+0x159) [0x11a7229]
 mongod(_ZN5mongo12verifyFailedEPKcS1_j+0x17e) [0x118b53e]
 mongod(_ZN5mongo11mutablebson7Element8addChildES1_b+0x277) [0x793b27]
 mongod(_ZN5mongo11mutablebson7Element8pushBackES1_+0x19) [0x79f389]
 mongod(_ZN5mongo11mutablebson7Element13appendElementERKNS_11BSONElementE+0x41) [0x79f4b1]
 mongod(_ZNK5mongo21ModifierObjectReplace5applyEv+0x319) [0xc36869]
 mongod(_ZN5mongo12UpdateDriver6updateERKNS_10StringDataEPNS_11mutablebson8DocumentEPNS_7BSONObjEPNS_11FieldRefSetE+0x285) [0xc5a485]
 mongod(_ZN5mongo6updateERKNS_13UpdateRequestEPNS_7OpDebugEPNS_12UpdateDriverEPNS_14CanonicalQueryE+0xc36) [0xc53cd6]
 mongod(_ZN5mongo14UpdateExecutor7executeEv+0x66) [0xc5f7a6]
 mongod(_ZN5mongo6updateERKNS_13UpdateRequestEPNS_7OpDebugE+0x27) [0xc4e447]
 mongod(_ZN5mongo21applyOperation_inlockERKNS_7BSONObjEbb+0xe16) [0xe5ca76]
 mongod(_ZN5mongo7replset8SyncTail9syncApplyERKNS_7BSONObjEb+0x749) [0xec5949]
 mongod(_ZN5mongo11ReplSetImpl12_initialSyncEv+0x804) [0xea1e04]
 mongod(_ZN5mongo11ReplSetImpl17syncDoInitialSyncEv+0x37) [0xea3177]
 mongod(_ZN5mongo11ReplSetImpl11_syncThreadEv+0x124) [0xec3844]
 mongod(_ZN5mongo11ReplSetImpl10syncThreadEv+0x30) [0xec38f0]
 mongod(_ZN5mongo15startSyncThreadEv+0xaa) [0xec3c0a]
 mongod() [0x1249dc9]
 /lib64/libpthread.so.0() [0x33188079d1]

finally after allocating a part of the datafile the recovery process start from begins again and again

Participants:

 Description   

I'm facing the following issue after MongoDB does not shutdown cleanly. I tried to run --repair parameter but recovery process exit with the same error message.

mongod --shardsvr --profile 0 --nojournal --bind_ip 192.168.10.11 --port 20002 --dbpath /home/shard2 --replSet crawler2
2015-05-29T22:53:42.682-0700 [initandlisten] MongoDB starting : pid=29125 port=20002 dbpath=/home/shard2 64-bit host=Crawler1-web1
2015-05-29T22:53:42.683-0700 [initandlisten] 
2015-05-29T22:53:42.683-0700 [initandlisten] ** WARNING: You are running on a NUMA machine.
2015-05-29T22:53:42.683-0700 [initandlisten] **          We suggest launching mongod like this to avoid performance problems:
2015-05-29T22:53:42.683-0700 [initandlisten] **              numactl --interleave=all mongod [other options]
2015-05-29T22:53:42.683-0700 [initandlisten] 
2015-05-29T22:53:42.683-0700 [initandlisten] db version v2.6.8
2015-05-29T22:53:42.683-0700 [initandlisten] git version: 3abc04d6d4f71de00b57378e3277def8fd7a6700
2015-05-29T22:53:42.683-0700 [initandlisten] build info: Linux build5.nj1.10gen.cc 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3 21:39:27 UTC 2014 x86_64 BOOST_LIB_VERSION=1_49
2015-05-29T22:53:42.683-0700 [initandlisten] allocator: tcmalloc
2015-05-29T22:53:42.683-0700 [initandlisten] options: { net: { bindIp: "192.168.10.11", port: 20002 }, operationProfiling: { mode: "off" }, replication: { replSet: "crawler2" }, sharding: { clusterRole: "shardsvr" }, storage: { dbPath: "/home/shard2", journal: { enabled: false } } }
2015-05-29T22:53:42.734-0700 [initandlisten] waiting for connections on port 20002
2015-05-29T22:53:42.738-0700 [rsStart] replSet I am 192.168.10.11:20002
2015-05-29T22:53:42.740-0700 [rsHealthPoll] replset info 192.168.10.13:20002 thinks that we are down
2015-05-29T22:53:42.740-0700 [rsHealthPoll] replSet member 192.168.10.13:20002 is up
2015-05-29T22:53:42.740-0700 [rsHealthPoll] replSet member 192.168.10.13:20002 is now in state SECONDARY
2015-05-29T22:53:42.740-0700 [rsHealthPoll] replset info 192.168.10.12:20002 thinks that we are down
2015-05-29T22:53:42.740-0700 [rsHealthPoll] replSet member 192.168.10.12:20002 is up
2015-05-29T22:53:42.740-0700 [rsHealthPoll] replSet member 192.168.10.12:20002 is now in state PRIMARY
2015-05-29T22:53:42.751-0700 [rsStart] local.oplog.rs Fatal Assertion 16968
2015-05-29T22:53:42.764-0700 [rsStart] local.oplog.rs 0x1205431 0x11a7229 0x1189d5d 0xf0479d 0xf048e4 0xf42b53 0xf42fa5 0xaa2a4c 0xd8237f 0xa4b65e 0xe7942f 0xe7bdb3 0xe85406 0x1249dc9 0x33188079d1 0x33184e88fd 
 mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x1205431]
 mongod(_ZN5mongo10logContextEPKc+0x159) [0x11a7229]
 mongod(_ZN5mongo13fassertFailedEi+0xcd) [0x1189d5d]
 mongod() [0xf0479d]
 mongod(_ZNK5mongo13ExtentManager13getPrevRecordERKNS_7DiskLocE+0x24) [0xf048e4]
 mongod(_ZN5mongo14CappedIterator13getNextCappedEPKNS_16NamespaceDetailsEPKNS_13ExtentManagerERKNS_7DiskLocENS_20CollectionScanParams9DirectionE+0x63) [0xf42b53]
 mongod(_ZN5mongo14CappedIterator7getNextEv+0x115) [0xf42fa5]
 mongod(_ZN5mongo14CollectionScan4workEPm+0xfc) [0xaa2a4c]
 mongod(_ZN5mongo12PlanExecutor7getNextEPNS_7BSONObjEPNS_7DiskLocE+0xef) [0xd8237f]
 mongod(_ZN5mongo7Helpers7getLastEPKcRNS_7BSONObjE+0x9e) [0xa4b65e]
 mongod(_ZN5mongo11ReplSetImpl21loadLastOpTimeWrittenEb+0x7f) [0xe7942f]
 mongod(_ZN5mongo11ReplSetImpl3_goEv+0x543) [0xe7bdb3]
 mongod(_ZN5mongo13startReplSetsEPNS_14ReplSetCmdlineE+0x56) [0xe85406]
 mongod() [0x1249dc9]
 /lib64/libpthread.so.0() [0x33188079d1]
 /lib64/libc.so.6(clone+0x6d) [0x33184e88fd]
2015-05-29T22:53:42.764-0700 [rsStart] 
 
***aborting after fassert() failure
 
 
2015-05-29T22:53:42.773-0700 [rsStart] SEVERE: Got signal: 6 (Aborted).
Backtrace:0x1205431 0x120480e 0x33184326a0 0x3318432625 0x3318433e05 0x1189dca 0xf0479d 0xf048e4 0xf42b53 0xf42fa5 0xaa2a4c 0xd8237f 0xa4b65e 0xe7942f 0xe7bdb3 0xe85406 0x1249dc9 0x33188079d1 0x33184e88fd 
 mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x1205431]
 mongod() [0x120480e]
 /lib64/libc.so.6() [0x33184326a0]
 /lib64/libc.so.6(gsignal+0x35) [0x3318432625]
 /lib64/libc.so.6(abort+0x175) [0x3318433e05]
 mongod(_ZN5mongo13fassertFailedEi+0x13a) [0x1189dca]
 mongod() [0xf0479d]
 mongod(_ZNK5mongo13ExtentManager13getPrevRecordERKNS_7DiskLocE+0x24) [0xf048e4]
 mongod(_ZN5mongo14CappedIterator13getNextCappedEPKNS_16NamespaceDetailsEPKNS_13ExtentManagerERKNS_7DiskLocENS_20CollectionScanParams9DirectionE+0x63) [0xf42b53]
 mongod(_ZN5mongo14CappedIterator7getNextEv+0x115) [0xf42fa5]
 mongod(_ZN5mongo14CollectionScan4workEPm+0xfc) [0xaa2a4c]
 mongod(_ZN5mongo12PlanExecutor7getNextEPNS_7BSONObjEPNS_7DiskLocE+0xef) [0xd8237f]
 mongod(_ZN5mongo7Helpers7getLastEPKcRNS_7BSONObjE+0x9e) [0xa4b65e]
 mongod(_ZN5mongo11ReplSetImpl21loadLastOpTimeWrittenEb+0x7f) [0xe7942f]
 mongod(_ZN5mongo11ReplSetImpl3_goEv+0x543) [0xe7bdb3]
 mongod(_ZN5mongo13startReplSetsEPNS_14ReplSetCmdlineE+0x56) [0xe85406]
 mongod() [0x1249dc9]
 /lib64/libpthread.so.0() [0x33188079d1]
 /lib64/libc.so.6(clone+0x6d) [0x33184e88fd]
 



 Comments   
Comment by Ramon Fernandez Marina [ 19/Jun/15 ]

ovidiu.tatar@3ziele.de, it is highly unlikely that clock synchronization was the source of the issue here, and I'd still encourage you to check for file corruption and storage health on your deployment to avoid future issues in this area.

I'm going to close this for now, let us know if the issue reappears.

Regards,
Ramón.

Comment by Ramon Fernandez Marina [ 18/Jun/15 ]

ovidiu.tatar@3ziele.de, glad to hear you worked around the issue. I'm investigating whether clock skew could be responsible for this and whether there's room for improvement if it is.

Regards,
Ramón.

Comment by Ovidiu Tatar [ 04/Jun/15 ]

Solved issue by synchronise time an all instances via $ sudo ntpdate pool.ntp.org
i would really appreciate if mongod would compare time in case of replica before start any production activity

Comment by Ovidiu Tatar [ 30/May/15 ]

unfortunately after recover some data I got the next error message

for this command I used a new --dbpath /home/shard1-tmp/ but the sync failed

015-05-30T10:25:26.598-0700 [FileAllocator] done allocating datafile /home/shard1-tmp/test.1, size: 2047MB, took 0.046 secs
2015-05-30T10:25:26.600-0700 [rsSync] build index on: test.oplog.rs properties: { v: 1, key:

{ _id: 1 }

, name: "id", ns: "test.oplog.rs" }
2015-05-30T10:25:26.600-0700 [rsSync] added index to empty collection
2015-05-30T10:25:26.600-0700 [rsSync] replSet initial sync cloning db: admin
2015-05-30T10:25:26.601-0700 [rsSync] replSet initial sync data copy, starting syncup
2015-05-30T10:25:26.601-0700 [rsSync] crawler.urls Assertion failure e.ok() src/mongo/bson/mutable/document.cpp 1919
2015-05-30T10:25:26.617-0700 [rsSync] crawler.urls 0x1205431 0x11a7229 0x118b53e 0x793b27 0x79f389 0x79f4b1 0xc36869 0xc5a485 0xc53cd6 0xc5f7a6 0xc4e447 0xe5ca76 0xec5949 0xea1e04 0xea3177 0xec3844 0xec38f0 0xec3c0a 0x1249dc9 0x33188079d1
mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x1205431]
mongod(_ZN5mongo10logContextEPKc+0x159) [0x11a7229]
mongod(_ZN5mongo12verifyFailedEPKcS1_j+0x17e) [0x118b53e]
mongod(_ZN5mongo11mutablebson7Element8addChildES1_b+0x277) [0x793b27]
mongod(ZN5mongo11mutablebson7Element8pushBackES1+0x19) [0x79f389]
mongod(_ZN5mongo11mutablebson7Element13appendElementERKNS_11BSONElementE+0x41) [0x79f4b1]
mongod(_ZNK5mongo21ModifierObjectReplace5applyEv+0x319) [0xc36869]
mongod(_ZN5mongo12UpdateDriver6updateERKNS_10StringDataEPNS_11mutablebson8DocumentEPNS_7BSONObjEPNS_11FieldRefSetE+0x285) [0xc5a485]
mongod(_ZN5mongo6updateERKNS_13UpdateRequestEPNS_7OpDebugEPNS_12UpdateDriverEPNS_14CanonicalQueryE+0xc36) [0xc53cd6]
mongod(_ZN5mongo14UpdateExecutor7executeEv+0x66) [0xc5f7a6]
mongod(_ZN5mongo6updateERKNS_13UpdateRequestEPNS_7OpDebugE+0x27) [0xc4e447]
mongod(_ZN5mongo21applyOperation_inlockERKNS_7BSONObjEbb+0xe16) [0xe5ca76]
mongod(_ZN5mongo7replset8SyncTail9syncApplyERKNS_7BSONObjEb+0x749) [0xec5949]
mongod(_ZN5mongo11ReplSetImpl12_initialSyncEv+0x804) [0xea1e04]
mongod(_ZN5mongo11ReplSetImpl17syncDoInitialSyncEv+0x37) [0xea3177]
mongod(_ZN5mongo11ReplSetImpl11_syncThreadEv+0x124) [0xec3844]
mongod(_ZN5mongo11ReplSetImpl10syncThreadEv+0x30) [0xec38f0]
mongod(_ZN5mongo15startSyncThreadEv+0xaa) [0xec3c0a]
mongod() [0x1249dc9]
/lib64/libpthread.so.0() [0x33188079d1]
2015-05-30T10:25:26.617-0700 [rsSync] replSet initial sync exception: 0 assertion src/mongo/bson/mutable/document.cpp:1919 9 attempts remaining

any ideas how to fix issue ?

Comment by Ramon Fernandez Marina [ 30/May/15 ]

Looks like the unclean shutdown may have corrupted the data files for the local database. I would suggest you resync this member from the primary, which should solve your problem. Can you please try the resync procedure and report back?

Thanks,
Ramón.

Generated at Thu Feb 08 03:48:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.