[SERVER-11808] Primary node keeps crashing - second time in one week Created: 21/Nov/13 Updated: 10/Dec/14 Resolved: 18/Mar/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 2.4.8 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Nic Cottrell (Personal) | Assignee: | Unassigned |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Centos6 |
||
| Attachments: |
|
| Operating System: | ALL |
| Participants: |
| Description |
|
Server crashes updating an index and can't restart. I tried to restart the server manually (see restart.log) but not sure how to proceed. This was the primary for the default shared in the cluster. The secondary has taken over for now... |
| Comments |
| Comment by Nic Cottrell (Personal) [ 18/Mar/14 ] | |||||||||||||||||||
|
Thanks Stephen. Yeah, I've managed to avoid these problems. I think a large part of the problem was hardware problems (bad hardware RAID) causing underlying corruption. Since moving machines I have not had further errors like this. | |||||||||||||||||||
| Comment by Stennie Steneker (Inactive) [ 18/Mar/14 ] | |||||||||||||||||||
|
Hi Nic, I noticed this issue is still open, but given the length of time that has passed I expect you must have found a solution for this. In regards to keys too long to index, I would note there is a behaviour change in MongoDB 2.6 that provides stronger enforcement on this limit (raising exceptions for limit violations, rather than skipping adding the document to the affected index): http://docs.mongodb.org/master/release-notes/2.6-compatibility/#enforce-index-key-length-limit/. I'm going to resolve this issue as Incomplete, but please feel free to comment or reopen if there is additional information to investigate or some feedback on the resolution. Thanks, | |||||||||||||||||||
| Comment by Nic Cottrell (Personal) [ 25/Nov/13 ] | |||||||||||||||||||
|
Here's the mongo data dir and the /var/log/messages from today. The node went down at about 10am local time (according to MMS) but I think that might have been me trying to restart... | |||||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 25/Nov/13 ] | |||||||||||||||||||
|
Can you send /var/log/messages and also ls -la of the data db directory? | |||||||||||||||||||
| Comment by Nic Cottrell (Personal) [ 25/Nov/13 ] | |||||||||||||||||||
|
It's hard to say - there's quite a variation in data so about 90% of the calls will result in inserts I'd say. I seemed to have missed some checks (the production version of the site still seems to upsert longer texts) and the primary crashed again overnight. This time the journal dir was empty but the mongod.lock was still there. Removing the lock and starting mongo gave this error (at least not a negative value):
| |||||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 25/Nov/13 ] | |||||||||||||||||||
|
The last line is not a concern (though probably means started mongo with a different user at some point). The errors in the java driver on upsert, are those likely new documents are documents being modified? | |||||||||||||||||||
| Comment by Nic Cottrell (Personal) [ 24/Nov/13 ] | |||||||||||||||||||
|
It seems that the server crashes unexpectedly and then the data is corrupted... Is this line a concern?
| |||||||||||||||||||
| Comment by Nic Cottrell (Personal) [ 24/Nov/13 ] | |||||||||||||||||||
|
Primary node on primary shard crashed again and is down. The secondary node has taken over. If it helps, I get a lot of: {{{ Sat Nov 23 12:35:39.120 [conn747] jerome5.TranslationQueue ERROR: key too large len:1211 max:1024 1211 jerome5.TranslationQueue.$tl_1_g_1_sl_1_st_1 Which are triggered by an upsert from the Java mongo driver. I've patched to the code to skip upserts on this collection when it would generate a key that is more than 1024 bytes. Will let you know if the crash happens again.... Right now will go run a DB repair again. |