[SERVER-24552] segmentation fault during initial sync under heavy load Created: 14/Jun/16 Updated: 25/Jul/16 Resolved: 25/Jul/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Internal Code |
| Affects Version/s: | 3.2.7 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Alan Jackson | Assignee: | Kelsey Schubert |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Operating System: | ALL |
| Steps To Reproduce: | Initial sync a new member on a heavily loaded, large cluster. |
| Participants: |
| Description |
|
Cluster under heavy load (during load testing.) Syncing a new cluster member failed, with a segmentation fault after about 30 minutes of syncing. |
| Comments |
| Comment by Kelsey Schubert [ 25/Jul/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi ajax@tvsquared.com and adq@tvsquared.com, Thank you for providing the outputs of the commands I listed. Regrettably, the .so file is too highly optimized to provide useful debugging information. We believe this segmentation fault is likely related to resource exhaustion. The workload recorded in the logs appears to be creating a large number (7000, on average) of very short lived incoming connections to the server. Reducing the number of connections opened per second may resolve this issue. Unfortunately, we have not been able to reproduce this issue on our side. If you are able to provide a sample workload which reliably reproduces this issue, please let us know and we will continue to investigate. Kind regards, | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrew de Quincey [ 18/Jun/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi, I'm Alan's colleague:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kelsey Schubert [ 17/Jun/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thank you for providing the diagnostic.data. Would you please clarify how much swap memory do you have configured? The stack trace you have provided includes these two lines:
These lines can be parsed by following the steps below: 1. Please post the output of the following commands:
These commands should identify the version and location of the lbpthread.so.0 library. 2. Please execute the following commands in your shell and post the output:
This set of instructions should provide the line numbers in libc. Thank you for help, | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alan Jackson [ 17/Jun/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Updated with corresponding diagnostic data. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alan Jackson [ 17/Jun/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
We installed the debug symbols, but they appear to have not appeared in the log file. I've attached a more recent crash with the corresponding diagnostic.data | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alan Jackson [ 16/Jun/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Andrew. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Andrew Morrow (Inactive) [ 15/Jun/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi ajax@tvsquared.com - We have done some investigation into this issue, but so far we have not been able to derive a root cause. The crash is within pthread_create, which seems odd. We have looked at the arguments we are passing to pthread_create at this point in the call stack, and all of them refer to stack variables or free functions. So we don't see a way that we would be passing invalid data into pthread_create. Given the large amount of data in flight, we suspect some sort of resource exhaustion. However, if that were the case, we would expect to see pthread_create return EAGAIN, not crash. Unfortunately, while we can symbolize the provided stack trace into our code, we don't have the same access to the local copy of libpthread that you are using. If there is any way you could try to symbolize that portion of the stack trace and get us a line number in libc, along with the exact libc version, that would go a long way towards trying to understand why this call to pthread_create leads to a crash. If you are interested in working with us to do that, please let us know and we can work with you to develop a procedure to do so. Finally, you stated that you are able to consistently reproduce the issue. If possible, could you please attach a tar of the diagnostic.data directory in your dbpath from the crashing node, immediately after the crash? It would also be useful if you could keep an eye on the memory usage of the server on the crashing node, leading up to the crash. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Alan Jackson [ 14/Jun/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
This server was a 3rd node. 8 core, 64GB ram. This issue recurs repeatedly whilst load on the primary is kept high. why does it not try sync from the secondary? |