[SERVER-24552] segmentation fault during initial sync under heavy load Created: 14/Jun/16  Updated: 25/Jul/16  Resolved: 25/Jul/16

Status: Closed
Project: Core Server
Component/s: Internal Code
Affects Version/s: 3.2.7
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Alan Jackson Assignee: Kelsey Schubert
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File mongosegfault.log     File mongosegfault.tar    
Operating System: ALL
Steps To Reproduce:

Initial sync a new member on a heavily loaded, large cluster.

Participants:

 Description   

Cluster under heavy load (during load testing.)
~1M operations/minute, ~500k inserts
3.2.7
Primary wiredtiger, 16 core, 64GB ram
Secondary mmap 8 core, 64GB ram
800GB of data, in ~600 databases

Syncing a new cluster member failed, with a segmentation fault after about 30 minutes of syncing.



 Comments   
Comment by Kelsey Schubert [ 25/Jul/16 ]

Hi ajax@tvsquared.com and adq@tvsquared.com,

Thank you for providing the outputs of the commands I listed. Regrettably, the .so file is too highly optimized to provide useful debugging information.

We believe this segmentation fault is likely related to resource exhaustion. The workload recorded in the logs appears to be creating a large number (7000, on average) of very short lived incoming connections to the server. Reducing the number of connections opened per second may resolve this issue.

Unfortunately, we have not been able to reproduce this issue on our side. If you are able to provide a sample workload which reliably reproduces this issue, please let us know and we will continue to investigate.

Kind regards,
Thomas

Comment by Andrew de Quincey [ 18/Jun/16 ]

Hi, I'm Alan's colleague:

cat /proc/swaps:
Filename                                Type            Size    Used    Priority
/data/swapfile                          file            2097148 22344   -1

ldd -v /usr/bin/mongod
        linux-vdso.so.1 =>  (0x00007ffd4e07f000)
        libssl.so.1.0.0 => /lib/x86_64-linux-gnu/libssl.so.1.0.0 (0x00007fcdda553000)
        libcrypto.so.1.0.0 => /lib/x86_64-linux-gnu/libcrypto.so.1.0.0 (0x00007fcdda10f000)
        librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fcdd9f06000)
        libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fcdd9d02000)
        libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fcdd99fa000)
        libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fcdd97e2000)
        libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fcdd95c4000)
        libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fcdd91fa000)
        /lib64/ld-linux-x86-64.so.2 (0x0000562016f8d000)
 
        Version information:
        /usr/bin/mongod:
                libc.so.6 (GLIBC_2.3.2) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.14) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.3) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.10) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.6) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.8) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.9) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.7) => /lib/x86_64-linux-gnu/libc.so.6
                libgcc_s.so.1 (GCC_3.0) => /lib/x86_64-linux-gnu/libgcc_s.so.1
                libgcc_s.so.1 (GCC_4.2.0) => /lib/x86_64-linux-gnu/libgcc_s.so.1
                libgcc_s.so.1 (GCC_3.3) => /lib/x86_64-linux-gnu/libgcc_s.so.1
                libgcc_s.so.1 (GCC_3.4) => /lib/x86_64-linux-gnu/libgcc_s.so.1
                libpthread.so.0 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libpthread.so.0
                libpthread.so.0 (GLIBC_2.3.2) => /lib/x86_64-linux-gnu/libpthread.so.0
                libcrypto.so.1.0.0 (OPENSSL_1.0.0) => /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
                ld-linux-x86-64.so.2 (GLIBC_2.3) => /lib64/ld-linux-x86-64.so.2
                libm.so.6 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libm.so.6
                libdl.so.2 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libdl.so.2
                libssl.so.1.0.0 (OPENSSL_1.0.0) => /lib/x86_64-linux-gnu/libssl.so.1.0.0
                libssl.so.1.0.0 (OPENSSL_1.0.1) => /lib/x86_64-linux-gnu/libssl.so.1.0.0
                librt.so.1 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/librt.so.1
        /lib/x86_64-linux-gnu/libssl.so.1.0.0:
                libc.so.6 (GLIBC_2.14) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.4) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.3.4) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libc.so.6
                libcrypto.so.1.0.0 (OPENSSL_1.0.1d) => /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
                libcrypto.so.1.0.0 (OPENSSL_1.0.1) => /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
                libcrypto.so.1.0.0 (OPENSSL_1.0.2) => /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
                libcrypto.so.1.0.0 (OPENSSL_1.0.0) => /lib/x86_64-linux-gnu/libcrypto.so.1.0.0
        /lib/x86_64-linux-gnu/libcrypto.so.1.0.0:
                libdl.so.2 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libdl.so.2
                libc.so.6 (GLIBC_2.3) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.7) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.14) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.4) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.3.4) => /lib/x86_64-linux-gnu/libc.so.6
        /lib/x86_64-linux-gnu/librt.so.1:
                libpthread.so.0 (GLIBC_2.3.2) => /lib/x86_64-linux-gnu/libpthread.so.0
                libpthread.so.0 (GLIBC_PRIVATE) => /lib/x86_64-linux-gnu/libpthread.so.0
                libpthread.so.0 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libpthread.so.0
                libc.so.6 (GLIBC_2.14) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.3.2) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_PRIVATE) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libc.so.6
        /lib/x86_64-linux-gnu/libdl.so.2:
                ld-linux-x86-64.so.2 (GLIBC_PRIVATE) => /lib64/ld-linux-x86-64.so.2
                libc.so.6 (GLIBC_PRIVATE) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libc.so.6
        /lib/x86_64-linux-gnu/libm.so.6:
                libc.so.6 (GLIBC_PRIVATE) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libc.so.6
        /lib/x86_64-linux-gnu/libgcc_s.so.1:
                libc.so.6 (GLIBC_2.14) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libc.so.6
        /lib/x86_64-linux-gnu/libpthread.so.0:
                ld-linux-x86-64.so.2 (GLIBC_2.2.5) => /lib64/ld-linux-x86-64.so.2
                ld-linux-x86-64.so.2 (GLIBC_2.3) => /lib64/ld-linux-x86-64.so.2
                ld-linux-x86-64.so.2 (GLIBC_PRIVATE) => /lib64/ld-linux-x86-64.so.2
                libc.so.6 (GLIBC_2.14) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.3.2) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_PRIVATE) => /lib/x86_64-linux-gnu/libc.so.6
                libc.so.6 (GLIBC_2.2.5) => /lib/x86_64-linux-gnu/libc.so.6
        /lib/x86_64-linux-gnu/libc.so.6:
                ld-linux-x86-64.so.2 (GLIBC_2.3) => /lib64/ld-linux-x86-64.so.2
                ld-linux-x86-64.so.2 (GLIBC_PRIVATE) => /lib64/ld-linux-x86-64.so.2

dpkg -s libc6
Package: libc6
Status: install ok installed
Priority: required
Section: libs
Installed-Size: 10767
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Architecture: amd64
Multi-Arch: same
Source: glibc
Version: 2.21-0ubuntu4.3
Replaces: libc6-amd64
Depends: libgcc1
Suggests: glibc-doc, debconf | debconf-2.0, locales
Breaks: hurd (<< 1:0.5.git20140203-1), libtirpc1 (<< 0.2.3), lsb-core (<= 3.2-27), nscd (<< 2.21)
Conflicts: prelink (<= 0.0.20090311-1), tzdata (<< 2007k-1), tzdata-etch
Conffiles:
 /etc/ld.so.conf.d/x86_64-linux-gnu.conf 593ad12389ab2b6f952e7ede67b8fbbf
Description: GNU C Library: Shared libraries
 Contains the standard libraries that are used by nearly all programs on
 the system. This package includes shared versions of the standard C library
 and the standard math library, as well as many others.
Homepage: http://www.gnu.org/software/libc/libc.html
Original-Maintainer: GNU Libc Maintainers <debian-glibc@lists.debian.org>

gdb ./lib/x86_64-linux-gnu/libpthread-2.21.so
Reading symbols from ./lib/x86_64-linux-gnu/libpthread-2.21.so...Reading symbols from /usr/lib/debug//lib/x86_64-linux-gnu/libpthread-2.21.so...done.
done.
(gdb) info line *(pthread_create+0x93E0)
No line number information available for address 0x10d10 <__restore_rt>
(gdb) info line *(pthread_create+0x4FB)
Line 711 of "pthread_create.c" starts at address 0x7e2b <__pthread_create_2_1+1275> and ends at 0x7e35 <__pthread_create_2_1+1285>.
(gdb) quit

Comment by Kelsey Schubert [ 17/Jun/16 ]

Hi ajax@tvsquared.com,

Thank you for providing the diagnostic.data. Would you please clarify how much swap memory do you have configured?

The stack trace you have provided includes these two lines:

Jun 14 13:28:00 collectordb2 mongod[17204]: libpthread.so.0(+0x10D10) [0x7f1a73c04d10]
Jun 14 13:28:00 collectordb2 mongod[17204]: libpthread.so.0(pthread_create+0x4FB) [0x7f1a73bfbe2b]

These lines can be parsed by following the steps below:

1. Please post the output of the following commands:

ldd -v /path/to/mongod
dpkg -s libc6

These commands should identify the version and location of the lbpthread.so.0 library.

2. Please execute the following commands in your shell and post the output:

gdb /path/to/libpthread.so.0
info line *(pthread_create+0x93E0)
info line *(pthread_create+0x4FB)
quit

This set of instructions should provide the line numbers in libc.

Thank you for help,
Thomas

Comment by Alan Jackson [ 17/Jun/16 ]

Updated with corresponding diagnostic data.

Comment by Alan Jackson [ 17/Jun/16 ]

We installed the debug symbols, but they appear to have not appeared in the log file.
I'm not sure how we would go about symbolising that, we still have the environment up that the error replicates in.

I've attached a more recent crash with the corresponding diagnostic.data

Comment by Alan Jackson [ 16/Jun/16 ]

Hi Andrew.
I've replicated the event again, and have installed the symbol files etc.
Its close of business here today, but I'll be able to send you a fresh syslog, plus diagnostic data etc tomorrow.

Comment by Andrew Morrow (Inactive) [ 15/Jun/16 ]

Hi ajax@tvsquared.com - We have done some investigation into this issue, but so far we have not been able to derive a root cause. The crash is within pthread_create, which seems odd. We have looked at the arguments we are passing to pthread_create at this point in the call stack, and all of them refer to stack variables or free functions. So we don't see a way that we would be passing invalid data into pthread_create. Given the large amount of data in flight, we suspect some sort of resource exhaustion. However, if that were the case, we would expect to see pthread_create return EAGAIN, not crash.

Unfortunately, while we can symbolize the provided stack trace into our code, we don't have the same access to the local copy of libpthread that you are using. If there is any way you could try to symbolize that portion of the stack trace and get us a line number in libc, along with the exact libc version, that would go a long way towards trying to understand why this call to pthread_create leads to a crash. If you are interested in working with us to do that, please let us know and we can work with you to develop a procedure to do so.

Finally, you stated that you are able to consistently reproduce the issue. If possible, could you please attach a tar of the diagnostic.data directory in your dbpath from the crashing node, immediately after the crash? It would also be useful if you could keep an eye on the memory usage of the server on the crashing node, leading up to the crash.

Comment by Alan Jackson [ 14/Jun/16 ]

This server was a 3rd node. 8 core, 64GB ram.
All are Amazon AWS ubuntu 15.10

This issue recurs repeatedly whilst load on the primary is kept high.
Typically within ~5 minutes.

why does it not try sync from the secondary?
Why does it crash rather than retrying?

Generated at Thu Feb 08 04:06:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.