[SERVER-23250] One node in replicated sharded clusters keeps crashing Created: 21/Mar/16 Updated: 26/Apr/16 Resolved: 26/Apr/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | 3.2.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Aise Bouma | Assignee: | Unassigned |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Steps To Reproduce: | Happens randomly |
||||||||
| Participants: | |||||||||
| Description |
|
One node crashes about once a day with error:
|
| Comments |
| Comment by Kelsey Schubert [ 26/Apr/16 ] | ||
|
Hi aisebouma, Thank you for letting us know that the issue you originally reported has gone away. I am going to close this ticket to reflect that. Regarding the new issue you are experiencing, from the information you have provided I believe this is a duplicate of Kind regards, | ||
| Comment by Aise Bouma [ 22/Apr/16 ] | ||
|
After changing the hardware: out of the 7 servers in the cluster this one keeps crashing now and then (about once a week now). The error is different now: 2016-04-22T14:42:42.709+0200 F REPL [rsSync] replication oplog stream went back in time. previous timestamp: 571a1c42:1bd newest timestamp: 571a1a60:45. Op being applied: { <data removed for security reasons>, module: "DEBUG", message: "Set ID = 114, position(s) = POS_INPUT_CONVEYOR_3" } } Is this a known issue? Do you need some extra info? | ||
| Comment by Aise Bouma [ 01/Apr/16 ] | ||
|
Ok I will do that. After changing the hardware and resyncing, the server only experienced one crash during a week now. And this crash seems to have a different cause. | ||
| Comment by Kelsey Schubert [ 31/Mar/16 ] | ||
|
Hi aisebouma, We would like to compare the oplog of primary and secondary at the time of the segfault to get a better idea of is triggering the crash. When you next experience this issue, following the crash, can you please follow the steps below:
Once you have dumped the oplog of the secondary, you may restart it with its standard configuration to have it rejoin the replica set. When you have these two dumps, please upload them to the same portal as before. Thank you again for you help, | ||
| Comment by Aise Bouma [ 31/Mar/16 ] | ||
|
My config servers contain a lot less data, so it must be the right node. Please remember that the server only crashed once a day. So it was probably triggered by some action on the database. | ||
| Comment by Ramon Fernandez Marina [ 30/Mar/16 ] | ||
|
aisebouma, I've downloaded your files and started up a node in maintenance mode. I'm running validate(true) on your data, but so far everything is working well. One question though: the data above shows the crashing node belongs to a replica set named midden, but when I start this node the logs say configReplSet, which is a config server. Are you sure you sent me the data from the right node? EDIT: I've been able to successfully validate the complete dataset you uploaded, so I'm wondering if we're looking at the right node. I'll retrace my steps in case I made an error, but can you please check on your end that you uploaded the dataset for the right node? Thanks, | ||
| Comment by Ramon Fernandez Marina [ 23/Mar/16 ] | ||
|
Thanks for taking the time to upload the data aisebouma, I'm downloading it to my local machine now. We'll post any updates to this ticket. Cheers, | ||
| Comment by Aise Bouma [ 23/Mar/16 ] | ||
|
The upload is complete now. I tried resyncing the node, but that also failed. I will now replace the nodes hardware and try to sync it. | ||
| Comment by Aise Bouma [ 22/Mar/16 ] | ||
|
OK I will give the upload a try. | ||
| Comment by Ramon Fernandez Marina [ 22/Mar/16 ] | ||
|
aisebouma, unfortunately I don't think it's possible to pinpoint the interesting data, but I'll ask others. Note that one option is to resync this node from the primary. While that may prevent us from investigating further it would allow you to bring this node back into operation. If you choose to upload the data you'll need to split it in 5GB chunks; you can use split as follows:
and upload all the part.* files. | ||
| Comment by Aise Bouma [ 22/Mar/16 ] | ||
|
The data is 57GB large. Uploading will take forever. Is any (small) part of particular interest? | ||
| Comment by Ramon Fernandez Marina [ 21/Mar/16 ] | ||
|
aisebouma, looks like there could be a race condition somewhere, as it's very strange that only one node is affected. The alternative is that your data got into a state that causes this segfault to trigger, so could you upload the dbpath contents for this node here? This is a private, secure upload portal where your files will only be accessible to MongoDB staff for the purpose if tracking down this bug. Thanks, | ||
| Comment by Ramon Fernandez Marina [ 21/Mar/16 ] | ||
|
Thanks for your report and for uploading the log aisebouma, we're investigating. |