[SERVER-11808] Primary node keeps crashing - second time in one week Created: 21/Nov/13  Updated: 10/Dec/14  Resolved: 18/Mar/14

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 2.4.8
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Nic Cottrell (Personal) Assignee: Unassigned
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Centos6


Attachments: Text File lsMongo.txt     HTML File messages     Text File mongo-crash-20131124.log     Text File mongod-nov21-crash.log     Text File mongod-nov21-restart.log    
Operating System: ALL
Participants:

 Description   

Server crashes updating an index and can't restart. I tried to restart the server manually (see restart.log) but not sure how to proceed. This was the primary for the default shared in the cluster. The secondary has taken over for now...



 Comments   
Comment by Nic Cottrell (Personal) [ 18/Mar/14 ]

Thanks Stephen. Yeah, I've managed to avoid these problems. I think a large part of the problem was hardware problems (bad hardware RAID) causing underlying corruption. Since moving machines I have not had further errors like this.

Comment by Stennie Steneker (Inactive) [ 18/Mar/14 ]

Hi Nic,

I noticed this issue is still open, but given the length of time that has passed I expect you must have found a solution for this.

In regards to keys too long to index, I would note there is a behaviour change in MongoDB 2.6 that provides stronger enforcement on this limit (raising exceptions for limit violations, rather than skipping adding the document to the affected index): http://docs.mongodb.org/master/release-notes/2.6-compatibility/#enforce-index-key-length-limit/.

I'm going to resolve this issue as Incomplete, but please feel free to comment or reopen if there is additional information to investigate or some feedback on the resolution.

Thanks,
Stephen

Comment by Nic Cottrell (Personal) [ 25/Nov/13 ]

Here's the mongo data dir and the /var/log/messages from today. The node went down at about 10am local time (according to MMS) but I think that might have been me trying to restart...

Comment by Eliot Horowitz (Inactive) [ 25/Nov/13 ]

Can you send /var/log/messages and also ls -la of the data db directory?

Comment by Nic Cottrell (Personal) [ 25/Nov/13 ]

It's hard to say - there's quite a variation in data so about 90% of the calls will result in inserts I'd say. I seemed to have missed some checks (the production version of the site still seems to upsert longer texts) and the primary crashed again overnight.

This time the journal dir was empty but the mongod.lock was still there. Removing the lock and starting mongo gave this error (at least not a negative value):

Mon Nov 25 09:25:47.678 [conn25]  authenticate db: admin { authenticate: 1, user: "mmsagent", nonce: "1af0a9549a6ecec6", key: "1541b9447f305decf775767f54ffbb1e" }
Mon Nov 25 09:25:57.381 [conn10] Assertion: 10334:BSONObj size: 0 (0x00000000) is invalid. Size must be between 0 and 16793600(16MB) First element: EOO
0xde05e1 0xda1d1b 0xda225c 0x6ec95f 0x8c4ccc 0xa8211f 0xa83e6a 0xa7ed51 0xa843ae 0xa87218 0x9fa499 0x9fb9b3 0x6e83a8 0xdccbae 0x7fab3349d851 0x7fab3284394d 
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xde05e1]
 /usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x9b) [0xda1d1b]
 /usr/bin/mongod() [0xda225c]
 /usr/bin/mongod(_ZNK5mongo7BSONObj14_assertInvalidEv+0x5bf) [0x6ec95f]
 /usr/bin/mongod(_ZN5mongo11BasicCursor7currentEv+0x5c) [0x8c4ccc]
 /usr/bin/mongod(_ZNK5mongo21ResponseBuildStrategy7currentEbPNS_13ResultDetailsE+0x11f) [0xa8211f]
 /usr/bin/mongod(_ZN5mongo20OrderedBuildStrategy11handleMatchEPNS_13ResultDetailsE+0xaa) [0xa83e6a]
 /usr/bin/mongod(_ZN5mongo20QueryResponseBuilder8addMatchEv+0xb1) [0xa7ed51]
 /usr/bin/mongod(_ZN5mongo23queryWithQueryOptimizerEiRKSsRKNS_7BSONObjERNS_5CurOpES4_S4_RKN5boost10shared_ptrINS_11ParsedQueryEEES4_RKNS_12ChunkVersionERNS7_10scoped_ptrINS_25PageFaultRetryableSectionEEERNSG_INS_19NoPageFaultsAllowedEEERNS_7MessageE+0x3be) [0xa843ae]
 /usr/bin/mongod(_ZN5mongo8runQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1_+0x1ac8) [0xa87218]
 /usr/bin/mongod() [0x9fa499]
 /usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x383) [0x9fb9b3]
 /usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x98) [0x6e83a8]
 /usr/bin/mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x42e) [0xdccbae]
 /lib64/libpthread.so.0(+0x7851) [0x7fab3349d851]
 /lib64/libc.so.6(clone+0x6d) [0x7fab3284394d]

Comment by Eliot Horowitz (Inactive) [ 25/Nov/13 ]

The last line is not a concern (though probably means started mongo with a different user at some point).

The errors in the java driver on upsert, are those likely new documents are documents being modified?

Comment by Nic Cottrell (Personal) [ 24/Nov/13 ]

It seems that the server crashes unexpectedly and then the data is corrupted...

Is this line a concern?

Sat Nov 23 15:07:45.071 [initandlisten] couldn't unlink socket file /tmp/mongodb-27017.sockerrno:1 Operation not permitted skipping

Comment by Nic Cottrell (Personal) [ 24/Nov/13 ]

Primary node on primary shard crashed again and is down. The secondary node has taken over. If it helps, I get a lot of:

{{{

Sat Nov 23 12:35:39.120 [conn747] jerome5.TranslationQueue ERROR: key too large len:1211 max:1024 1211 jerome5.TranslationQueue.$tl_1_g_1_sl_1_st_1
Sat Nov 23 12:35:39.129 [conn747] jerome5.TranslationQueue ERROR: key too large len:1634 max:1024 1634 jerome5.TranslationQueue.$tl_1_g_1_sl_1_st_1
Sat Nov 23 12:35:39.133 [conn747] jerome5.TranslationQueue ERROR: key too large len:1535 max:1024 1535 jerome5.TranslationQueue.$tl_1_g_1_sl_1_st_1
Sat Nov 23 12:35:39.143 [conn747] jerome5.TranslationQueue ERROR: key too large len:1914 max:1024 1914 jerome5.TranslationQueue.$tl_1_g_1_sl_1_st_1
Sat Nov 23 12:35:39.147 [conn747] jerome5.TranslationQueue ERROR: key too large len:1323 max:1024 1323 jerome5.TranslationQueue.$tl_1_g_1_sl_1_st_1
Sat Nov 23 12:35:39.157 [conn747] jerome5.TranslationQueue ERROR: key too large len:1687 max:1024 1687 jerome5.TranslationQueue.$tl_1_g_1_sl_1_st_1
Sat Nov 23 12:35:39.161 [conn747] jerome5.TranslationQueue ERROR: key too large len:1511 max:1024 1511 jerome5.TranslationQueue.$tl_1_g_1_sl_1_st_1
}}}

Which are triggered by an upsert from the Java mongo driver. I've patched to the code to skip upserts on this collection when it would generate a key that is more than 1024 bytes. Will let you know if the crash happens again.... Right now will go run a DB repair again.

Generated at Thu Feb 08 03:26:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.