[SERVER-11698] Fatal Assertion during replication Created: 14/Nov/13 Updated: 10/Dec/14 Resolved: 19/Mar/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.4.8 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Tuomas Silen | Assignee: | Unassigned |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Linux, ubuntu 12.04 64bit |
||
| Operating System: | ALL |
| Steps To Reproduce: | unknown |
| Participants: |
| Description |
|
One of the replicas got fatal assertion 16360 during replication: Thu Nov 14 07:44:15.896 [repl writer worker 1] production.messages Assertion failure type == cbindata src/mongo/db/key.cpp 585 } ***aborting after fassert() failure Thu Nov 14 07:44:15.913 Got signal: 6 (Aborted). Thu Nov 14 07:44:15.915 Backtrace: After that the mongod no longer starts, just crashes to the same thing every time. The document contained an (indexed) array field where one entry was "schöne", so it included a non-ascii character, hard to say if it's relevant. The other replica handled it just fine. We've seen these replication errors every few months, previously with 2.2.x and now with 2.4.x. It takes a week to resync a replica, so is it safe to just insert that document (or empty one with the same ts and h) manually to the local's db.oplog.rs and it would then continue replicating normally? |
| Comments |
| Comment by Tuomas Silen [ 18/Mar/14 ] |
|
Hi Stephen, Yes, sorry for not updating the ticket. We ended up replacing the disks after the issue kept repeating and after that we haven't seen it again. So it would seem like it did indeed have something to do with the disks although smart, swraid, fsck, etc. didn't find anything and there was nothing in logs/dmesg. Feel free to close the ticket. |
| Comment by Stennie Steneker (Inactive) [ 18/Mar/14 ] |
|
Hi Tuomas, Apologies for the delay in follow-up .. are you still seeing this issue, or were you able to resolve? Thanks, |
| Comment by Tuomas Silen [ 18/Nov/13 ] |
|
There are no disk errors in syslog/dmesg and smart doesn't show any errors. The disks are about 5 month old SSD disks. |
| Comment by Eliot Horowitz (Inactive) [ 18/Nov/13 ] |
|
Can you check the system logs for disk errors? |
| Comment by Tuomas Silen [ 15/Nov/13 ] |
|
We entered that entry to the oplog and it then continued to successfully replicate again, but after a day the same mongod crashed again to another document (also that one contained ö and ü characters). So I guess the index is just somehow corrupted there. |