[SERVER-17147] sharded cluster replSet mongod crash Created: 02/Feb/15  Updated: 20/Mar/15  Resolved: 20/Mar/15

Status: Closed
Project: Core Server
Component/s: Index Maintenance
Affects Version/s: 2.6.7
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: SRR Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

Sharded cluster with replica sets (2 members 1 arbiter); after upgrading from 2.4.5 to 2.6.6 and/or 2.6.7 the mongod(s) crash at random times. Sometimes both members of a replica set crash one after the other..

2015-01-26T09:21:10.186-0500 [rsMgr] replSet info electSelf 0
2015-01-26T09:21:10.494-0500 [rsHealthPoll] replSet member machine4-lx:25920 is up
2015-01-26T09:21:10.494-0500 [rsHealthPoll] replSet member machine4-lx:25920 is now in state SECONDARY
2015-01-26T09:21:10.647-0500 [rsMgr] replSet PRIMARY
2015-01-26T09:21:10.650-0500 [rsMgr] CMD: drop ssp.tmp.mr.testresults_255_inc
2015-01-26T09:21:10.660-0500 [rsMgr] ERROR: About to fassert -  numIndexesTotal(): 0 numSystemIndexesEntries: 1 _entries.size(): 0 indexNamesToDrop: 1 haveIdIndex: 0
2015-01-26T09:21:10.660-0500 [rsMgr] ssp Fatal Assertion 17328
2015-01-26T09:21:10.679-0500 [rsMgr] ssp 0x11fd1b1 0x119efa9 0x1181add 0x8e5d0c 0x8c74cb 0xa3358e 0xa2949a 0xa2b611 0xa2cd36 0xd61654 0xba21a2 0xba3780 0xba4d06 0x7ebb1a 0x7b54ba 0xb95d66 0x7b87ca 0x7b8f22 0x7c5a40 0x7d1e2f 
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11fd1b1]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod(_ZN5mongo10logContextEPKc+0x159) [0x119efa9]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod(_ZN5mongo13fassertFailedEi+0xcd) [0x1181add]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod(_ZN5mongo12IndexCatalog14dropAllIndexesEb+0xf4c) [0x8e5d0c]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod(_ZN5mongo8Database14dropCollectionERKNS_10StringDataE+0x37b) [0x8c74cb]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod(_ZN5mongo7CmdDrop3runERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x3ae) [0xa3358e]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod(_ZN5mongo12_execCommandEPNS_7CommandERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x3a) [0xa2949a]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod(_ZN5mongo7Command11execCommandEPS0_RNS_6ClientEiPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0x1721) [0xa2b611]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod(_ZN5mongo12_runCommandsEPKcRNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x6c6) [0xa2cd36]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod(_ZN5mongo11newRunQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1_+0x2474) [0xd61654]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod() [0xba21a2]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x580) [0xba3780]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod(_ZN5mongo14DBDirectClient4callERNS_7MessageES2_bPSs+0xb6) [0xba4d06]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod(_ZN5mongo14DBClientCursor4initEv+0xba) [0x7ebb1a]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod(_ZN5mongo12DBClientBase5queryERKSsNS_5QueryEiiPKNS_7BSONObjEii+0xea) [0x7b54ba]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod(_ZN5mongo14DBDirectClient5queryERKSsNS_5QueryEiiPKNS_7BSONObjEii+0x56) [0xb95d66]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod(_ZN5mongo17DBClientInterface5findNERSt6vectorINS_7BSONObjESaIS2_EERKSsNS_5QueryEiiPKS2_i+0x9a) [0x7b87ca]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod(_ZN5mongo17DBClientInterface7findOneERKSsRKNS_5QueryEPKNS_7BSONObjEi+0x72) [0x7b8f22]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod(_ZN5mongo20DBClientWithCommands10runCommandERKSsRKNS_7BSONObjERS3_i+0xb0) [0x7c5a40]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod(_ZN5mongo20DBClientWithCommands14dropCollectionERKSsPNS_7BSONObjE+0x15f) [0x7d1e2f]
2015-01-26T09:21:10.679-0500 [rsMgr] 
 
***aborting after fassert() failure
 
 
2015-01-26T09:21:10.695-0500 [rsMgr] SEVERE: Got signal: 6 (Aborted).
Backtrace:0x11fd1b1 0x11fc58e 0x3aee432920 0x3aee4328a5 0x3aee434085 0x1181b4a 0x8e5d0c 0x8c74cb 0xa3358e 0xa2949a 0xa2b611 0xa2cd36 0xd61654 0xba21a2 0xba3780 0xba4d06 0x7ebb1a 0x7b54ba 0xb95d66 0x7b87ca 
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x11fd1b1]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod() [0x11fc58e]
 /lib64/libc.so.6() [0x3aee432920]
 /lib64/libc.so.6(gsignal+0x35) [0x3aee4328a5]
 /lib64/libc.so.6(abort+0x175) [0x3aee434085]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod(_ZN5mongo13fassertFailedEi+0x13a) [0x1181b4a]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod(_ZN5mongo12IndexCatalog14dropAllIndexesEb+0xf4c) [0x8e5d0c]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod(_ZN5mongo8Database14dropCollectionERKNS_10StringDataE+0x37b) [0x8c74cb]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod(_ZN5mongo7CmdDrop3runERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x3ae) [0xa3358e]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod(_ZN5mongo12_execCommandEPNS_7CommandERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x3a) [0xa2949a]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod(_ZN5mongo7Command11execCommandEPS0_RNS_6ClientEiPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0x1721) [0xa2b611]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod(_ZN5mongo12_runCommandsEPKcRNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x6c6) [0xa2cd36]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod(_ZN5mongo11newRunQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1_+0x2474) [0xd61654]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod() [0xba21a2]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x580) [0xba3780]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod(_ZN5mongo14DBDirectClient4callERNS_7MessageES2_bPSs+0xb6) [0xba4d06]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod(_ZN5mongo14DBClientCursor4initEv+0xba) [0x7ebb1a]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod(_ZN5mongo12DBClientBase5queryERKSsNS_5QueryEiiPKNS_7BSONObjEii+0xea) [0x7b54ba]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod(_ZN5mongo14DBDirectClient5queryERKSsNS_5QueryEiiPKNS_7BSONObjEii+0x56) [0xb95d66]
 /ssp/data1/mongodb/mongodb-linux-x86_64-2.6.7/bin/mongod(_ZN5mongo17DBClientInterface5findNERSt6vectorINS_7BSONObjESaIS2_EERKSsNS_5QueryEiiPKS2_i+0x9a) [0x7b87ca]



 Comments   
Comment by Sam Kleinman (Inactive) [ 20/Mar/15 ]

Sorry for not getting back to you sooner, and sorry that this is frustrating. While the 2.4->2.6 upgrade process can be frustrating, the additional restrictions in 2.6, do make deployments more stable and reliable.

It looks like the upgradeCheckAllDBs method will detect cases where you have collections that don't have indexes on the _id field. If these collections exist on secondaries; you can restart these instances as standalone (i.e. non-replica set), with the older version, and create the required indexes or drop these collections. At this point you should be able to upgrade the instances starting them as members of the replica set. See the [documentation|http://docs.mongodb.org/manual/release-notes/2.6-upgrade/| for complete upgrade instructions.

Sorry again for the frustration, and hope you can successfully upgrade.

Regards,
sam

Comment by SRR [ 09/Feb/15 ]

So if anyone else runs into this problem you must kill the mongod process (if it's not dead already) with the stale collection and then restart it (same settings/conf) but with the mongod executable from MongoDB 2.4.10.
You can then drop the offending collection successfully (I may have re-created and dropped it through the mongos as well - tried many things over time so operations were finally synced from the primary oplog without it crashing).

For the issue I mentioned in the previous comment I had it happen again and had to rs.stepDown(10) all the replica set primaries such that a secondary became primary before it started working again.......ridiculous

Comment by SRR [ 03/Feb/15 ]

So.. I tried to drop that tmp collection.. but it complained that it had to be the master.. so I tried to step down the master but of course the secondary died as shown in the initial log...

frustratingly, the database then stopped working for our app "failed: MR post processing failed: { errmsg: "exception: could not initialize cursor across all shards because : socket exception [SEND_ERROR]" even though only that secondary was dead.. so I began restarting the app/mongod/that secondary/mongos but nothing worked, kept getting the same error.. after 15 minutes it magically started working again... so maybe some database cache was the problem... sigh

any ideas?

Comment by SRR [ 03/Feb/15 ]

Okay.. actually I got a warning when I connected which I haven't seen before so this is probably my issue:

Server has startup warnings:
2015-02-02T17:01:46.480-0500 [initandlisten] WARNING: the collection 'ssp.tmp.mr.testresults_255_inc' lacks a unique index on _id. This index is needed for replication to function properly
2015-02-02T17:01:46.480-0500 [initandlisten] To fix this, you need to create a unique index on _id. See http://dochub.mongodb.org/core/build-replica-set-indexes
rs0:SECONDARY> rs.slaveOk()
rs0:SECONDARY> db.system.indexes.getIndexSpecs()
[ ]

Comment by Ramon Fernandez Marina [ 03/Feb/15 ]

aerospace, apologies if I wasn't clear – I meant to ask you to run

db.system.indexes.getIndexSpecs()

in all the data-bearing nodes where you see the assertion (not against mongos). Can you please connect to each mongod and run the above command? Note that in secondaries you may need to run:

rs.slaveOk()

first.

If any of the mongod servers produces a non-empty output then we'll know we're dealing with SERVER-14999 (which has a verified workaround).

Thanks,
Ramón.

Comment by SRR [ 02/Feb/15 ]

mongos> db.system.indexes.getIndexSpecs()
[ ]

Comment by Andy Schwerin [ 02/Feb/15 ]

Unfortunately, the stack trace is so deep that it's hard to be 100% certain, but the log suggests that the crash happens while deleting a temporary collection created by a mapreduce that was running on the former primary. The node whose log snippet is presented is just becoming primary at the top, and goes to drop a collection with the telltale .tmp.mr. in its name. The stack trace, though truncated, indicates that the drop-collection call was initiated by MongoD, not the client, which also indicates temp-collection clean-up.

Comment by Ramon Fernandez Marina [ 02/Feb/15 ]

aerospace, this might be a duplicate of SERVER-14999. Can you please post the output of

db.system.indexes.getIndexSpecs()

?

Generated at Thu Feb 08 03:43:27 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.