|
Thank you george.wangensteen@mongodb.com and MongoDB team! We noticed the same thing in the logs that you identified and appreciate all your help in identifying what is going on. It's still perplexing to us why an idle bash session would mess around with this mongodb oplog collection. The only theory we have is it has something to do with the combination of mosh server and tmux. I'm just putting the note here in case anyone else finds this issue and has those 2 procs running as well.
|
|
Hi amit.gupta@opensense.com ,
Thanks for the additional information. We were able to use it to determine that an outside bash process is indeed truncating the relevant WiredTiger file while mongod is running, causing the crash. Here's a quick summary of what we've found:
140343: Checkpointer 1900 [003] 495405.040696: xfs:xfs_itruncate_extents_start: dev 9:127 ino 0x600d8bdf size 0x190000 new_size 0x190000
|
141194: bash 48011 [015] 498216.220457: xfs:xfs_itruncate_extents_start: dev 9:127 ino 0x600d8bdf size 0x0 new_size 0x0
|
141205: ReplCoo.xtern-3 2031 [013] 498253.231804: xfs:xfs_itruncate_extents_start: dev 9:127 ino 0x600d8bdf size 0x18a000 new_size 0x18a000
|
- As you can see, at time 498216.220457, a bash process with PID 48011 truncates the file to size 0. We can see that this is different from the mongod process shown directly above, with a different PID.
- Using the boot-time you provided, we can convert this time into 1684948296 in seconds, which is 1:11:36 on May 24th, 25 seconds before the crash. Because the checkpointer thread runs every minute and is the first mongod thread to read the file after the truncate, it sees the truncated file and crashes.
- We also see the process with PID 48011 in the ps output you provided, confirming it is not mongod. Unfortunately, ps only tells us this process is bash inside a tmux session; to determine what command-line invocation of bash this particular process was started with, you'd need to read /proc/48011/cmdline, assuming the same shell/tmux-session is running.
I'm glad we've been able to get to the bottom of this issue. Since we've determined that the issue is outside of MongoDB/involves another process unsafely truncating the MongoDB datafiles, I don't think a SERVER ticket is needed any more. I'm going to go ahead and close this issue. If you need additional help determining the identity of the offending process, I'm sure our consulting teams could help.
|
|
no prob.. after a bunch of jiggering, i think it's uploaded as ashdb2-rocky8-mdb606-audit.tar.gz (13MB)
The server boot_time = 1684450080 (for use with the timestamps in logs)
it contains (amongst other things)
- collection-18-5971048155957614497.wt
- mongod.log
- audit.log
- ls -i output
- perf-script.log
Upload return code 201:
< HTTP/2 201
|
< server: nginx
|
< date: Thu, 25 May 2023 21:30:23 GMT
|
< content-type: application/json
|
< content-length: 1190
|
< cache-control: no-cache, no-store
|
< x-envoy-upstream-service-time: 690
|
< strict-transport-security: max-age=31536000
|
< x-box-original-ingress-adc-host: prod-c-traffic-manager-mrz1
|
< via: 1.1 google
|
< alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000
|
<
|
{ [5 bytes data]
|
* TLSv1.3 (IN), TLS app data, [no content] (0):
|
{ [1 bytes data]
|
* TLSv1.3 (IN), TLS app data, [no content] (0):
|
{ [1 bytes data]
|
100 12.9M 100 1190 100 12.9M 307 3434k 0:00:03 0:00:03 --:--:-- 3433k
|
* Connection #0 to host upload.box.com left intact
|
|
|
Hi amit.gupta@opensense.com ,
I suspect this is likely due to an error upstream due to the large size of the file. I will follow up internally, but in the meanwhile, you wanted up upload just the mongod.log and auditd.log files (ideally compressed), we might be able to look at those. It would help if you could confirm the file name, file size, and curl output after attempting to upload the files. Thanks.
|
|
Sure, I just tried running it with verbose and here is the output.. Apologies, but I didn't see an error before, but it looks like after some time there is a HTTP/2 500 error:
} [1 bytes data]
|
23 5171M 0 0 23 1234M 0 7470k 0:11:48 0:02:49 0:08:59 0* TLSv1.3 (IN), TLS app data, [no content] (0):
|
< HTTP/2 500
|
< server: nginx
|
< date: Thu, 25 May 2023 20:20:37 GMT
|
< content-type: text/html; charset=UTF-8
|
< content-length: 0
|
< x-box-original-ingress-adc-host: prod-b-traffic-manager-33bb
|
< strict-transport-security: max-age=31536000
|
< via: 1.1 google
|
< alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000
|
* HTTP error before end of send, stop sending
|
<
|
{ [0 bytes data]
|
23 5171M 0 0 23 1234M 0 7469k 0:11:49 0:02:49 0:09:00 0
|
* Connection #0 to host upload.box.com left intact
|
|
|
Hi amit.gupta@opensense.com - unfortunately I still can't find the data present in the portal. I've checked with our team and we don't see anything interesting in the upload service that might be preventing this. Could you please send the verbose curl output (i.e. running the command with -vvv) and confirm that curl is finishing with a 200 status code?
|
|
It looks complete now without any errors.
|
|
Strange.. I've just clicked your link in your most recent message, refreshed the page/snippet and am uploading now. It should complete in about 10-15minutes.. the upload speed seems to get throttled.
|
|
Hi amit.gupta@opensense.com , unfortunately I still don't see your upload in the portal. I tested the link I sent in my previous comment by uploading a file myself and it worked, so I'm not sure what the issue is. Did the cURL command to upload the files complete, or is it still running? Just to confirm, the link I used is https://amphora.corp.mongodb.com/public/upload/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmb2xkZXJfaWQiOiIyMDgyNDk2NDY3ODYiLCJleHAiOjE2ODc2MTU1ODJ9.FlppVO-0BhJF4xFdx83oXPLa9byTBqnzufPWaYFWpCc
can you confirm that is the link you are using to generate the curl snippet? Please note that each time you refresh the link, a new curl snippet will be generated, so it may help to refresh the page if you have it bookmarked. Let me know if that works. Thanks.
If there are still issues, it may help to run the provided cURL command with the "-vvv" option set to increase verbosity, and to save the output instead of piping it to /dev/null.
|
|
no problem.. just uploaded to your latest link
|
|
Hi amit.gupta@opensense.com ,
Unfortunately, I still don't see any data in the upload portal on my end. I've generated a new upload link for you so that we can try again to transfer the data.
|
|
it looks like it's a different link. I'm now uploading to https://amphora.corp.mongodb.com/public/upload/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmb2xkZXJfaWQiOiIyMDgyNDk2NDY3ODYiLCJleHAiOjE2ODcwMzE5MzB9._2i0EXJOPXs6Qg1o5r7D5ftTGfv-jgc2CCxJLm0zcrQ
|
|
I used this link (which I bookmarked from last time): https://amphora.corp.mongodb.com/public/upload/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmb2xkZXJfaWQiOiIyMDYyMjkzMDk4MTgiLCJleHAiOjE2ODcwMjk3MDF9.z6PbRb8vaH-PwbLLt646SxYNctwo6_yBJSJQjdVlISw
filename: ashdb2-rocky-6.0.6-crash-with-perf-audit.tar
Is it not correct?
|
|
Hi amit.gupta@opensense.com , I don't see the data in the most recent upload portal we exchanged data over. Is that the upload portal you used? Could you let me know the name of the uploaded file? Just for clarity: you do need to get a new cURL snippet from the upload link every time you upload (the cURL snippet changes when the page is refreshed).
|
|
Thanks, the upload is complete. It also includes the auditd.log, the perf script output (perf-script.log) and the "ls -i" output so you can correlate the inode ID back. The inode ID seems to be decimal in some files and hex in other files. Also the timestamps seem to be the number of seconds since boot. So to get the actual timestamp, we can use something like this:
# Assuming event_time_after_boot is in seconds
|
event_time_after_boot=50000
|
|
# Get the system boot time in seconds since 1970-01-01 00:00:00 UTC
|
boot_time=$(date -d "$(who -b | awk '{print $3 " " $4}')" +%s)
|
|
# Calculate event time
|
event_time=$((boot_time + event_time_after_boot))
|
|
# Display event time in a human-readable format
|
date -d @$event_time
|
On our server, boot_time = 1684450080
|
|
Hi amit.gupta@opensense.com ,
Thanks! Let me know when the upload completes. Also, please include the auditd logs / any truncate or open/write events on the relevant file.
|
|
Hi george.wangensteen@mongodb.com We just got another crash on ashdb2 (the server we are monitoring closely). I'm going to gather up all the logs, coredumps and the perf stuff. I was wondering if maybe you wanted to hop on a screen share to poke around the server with me? This is the Rocky Linux 8 box with mdb6.0.6 that was built manually.
|
|
hi george.wangensteen@mongodb.com I think I got your note about the new upload link late. I'm uploading the files for the 3 crashes that happened today on ashdb2, rocky linux8, mongodb 6.0.6. Thanks!
|
|
Will do.. Hopefully the "perf record -e 'xfs:xfs_itruncate_extents_start'" results should give us the inode of the file that we can correlate back to collection-18.. I will enable verbose debug logging next in /etc/mongod.conf to see if there are any clues.
|
|
Hi amit.gupta@opensense.com ,
We've done some analysis the first crash you sent tonight. We found that:
- The file WiredTiger panics on is again the data file for the oplog
- The data file again starts with a newline-byte
- The data file has a total of 4364779520 bytes; after the newline byte, the next 4363800584 bytes are null/zero bytes. The file appears to contain valid Oplog data after these null bytes.
We think these findings are highly consistent with our theory that some process is truncating the Oplog datafile while it is in use by MongoDB - we see that almost the entirety of the file appears truncated (i.e. it begins with a newline byte and is zero-filled), while the very end contains data that was likely written by MongoDB at a specific offset after the truncation occurred.
I'm glad to hear you have auditd set-up to watch the files. Hopefully, we will get another reproduction with the auditing enabled, so we can find out if MongoDB or another process is truncating the file around the time of the issue. I would in particular search for / try and identify truncate syscalls and/or open syscalls affecting the file "collection-18-XXXXXXXXX.wt".
|
|
I also have the following command running in a tmux session to try and capture truncate activity.
perf record -e 'xfs:xfs_itruncate_extents_start'
|
|
|
I've got auditd monitoring access to all files in /var/lib/mongo. Below are some example entries. We will see if anything else is touching this directory or files inside of it at the time of the next crash.
type=SYSCALL msg=audit(1684450927.004:106): arch=c000003e syscall=257 success=yes exit=148 a0=ffffff9c a1=56251188d980 a2=241 a3=1b6 items=2 ppid=1 pid=1882 auid=4294967295 uid=989 gid=985 euid=989 suid=989 fsuid=989 egid=985 sgid=985 fsgid=985 tty=(none) ses=4294967295 comm="ftdc" exe="/usr/bin/mongod" key="folder-access"ARCH=x86_64 SYSCALL=openat AUID="unset" UID="mongod" GID="mongod" EUID="mongod" SUID="mongod" FSUID="mongod" EGID="mongod" SGID="mongod" FSGID="mongod"
|
type=CWD msg=audit(1684450927.004:106): cwd="/"
|
type=PATH msg=audit(1684450927.004:106): item=0 name="/var/lib/mongo/diagnostic.data/" inode=1611500507 dev=09:7f mode=040700 ouid=989 ogid=985 rdev=00:00 nametype=PARENT cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0OUID="mongod" OGID="mongod"
|
type=PATH msg=audit(1684450927.004:106): item=1 name="/var/lib/mongo/diagnostic.data/metrics.interim.temp" inode=1611500512 dev=09:7f mode=0100600 ouid=989 ogid=985 rdev=00:00 nametype=CREATE cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0OUID="mongod" OGID="mongod"
|
type=PROCTITLE msg=audit(1684450927.004:106): proctitle=2F7573722F62696E2F6D6F6E676F64002D66002F6574632F6D6F6E676F642E636F6E66
|
type=SYSCALL msg=audit(1684450927.004:107): arch=c000003e syscall=82 success=yes exit=0 a0=56251188d980 a1=562511884580 a2=7f34da16f570 a3=562510000 items=5 ppid=1 pid=1882 auid=4294967295 uid=989 gid=985 euid=989 suid=989 fsuid=989 egid=985 sgid=985 fsgid=985 tty=(none) ses=4294967295 comm="ftdc" exe="/usr/bin/mongod" key="folder-access"ARCH=x86_64 SYSCALL=rename AUID="unset" UID="mongod" GID="mongod" EUID="mongod" SUID="mongod" FSUID="mongod" EGID="mongod" SGID="mongod" FSGID="mongod"
|
type=CWD msg=audit(1684450927.004:107): cwd="/"
|
type=PATH msg=audit(1684450927.004:107): item=0 name="/var/lib/mongo/diagnostic.data/" inode=1611500507 dev=09:7f mode=040700 ouid=989 ogid=985 rdev=00:00 nametype=PARENT cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0OUID="mongod" OGID="mongod"
|
type=PATH msg=audit(1684450927.004:107): item=1 name="/var/lib/mongo/diagnostic.data/" inode=1611500507 dev=09:7f mode=040700 ouid=989 ogid=985 rdev=00:00 nametype=PARENT cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0OUID="mongod" OGID="mongod"
|
type=PATH msg=audit(1684450927.004:107): item=2 name="/var/lib/mongo/diagnostic.data/metrics.interim.temp" inode=1611500512 dev=09:7f mode=0100600 ouid=989 ogid=985 rdev=00:00 nametype=DELETE cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0OUID="mongod" OGID="mongod"
|
type=PATH msg=audit(1684450927.004:107): item=3 name="/var/lib/mongo/diagnostic.data/metrics.interim" inode=1611500514 dev=09:7f mode=0100600 ouid=989 ogid=985 rdev=00:00 nametype=DELETE cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0OUID="mongod" OGID="mongod"
|
type=PATH msg=audit(1684450927.004:107): item=4 name="/var/lib/mongo/diagnostic.data/metrics.interim" inode=1611500512 dev=09:7f mode=0100600 ouid=989 ogid=985 rdev=00:00 nametype=CREATE cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0 cap_frootid=0OUID="mongod" OGID="mongod"
|
type=PROCTITLE msg=audit(1684450927.004:107): proctitle=2F7573722F62696E2F6D6F6E676F64002D66002F6574632F6D6F6E676F642E636F6E66
|
|
|
Interesting. I've just got a 3rd crash today with ashdb2. This time zabbix-agent was disabled and everything else on the server comes from minimal rocky8 install. This error looks like the 0 checksum error with the many 00 bytes. I am uploading the file named ashdb2-third-crash-rockylinux8-mongod6.0.6.tar (though the xfer is very slow for some reason). It contains both core dump and /var/lib/mongo as well.
I'm going to look into how to get auditd monitoring the /var/lib/mongo directory and keep you posted.
|
|
As a background, I've reset ashdb2's data and log directories again. I've also disabled the zabbix-agent service just to eliminate that as well.
|
|
Hi amit.gupta@opensense.com ,
Sorry for the extended back-and-forth. There appears to be some issue with the previous upload-portal (we periodically need to expire these for security/privacy). I've generated a new one, which you should be able to access via this link . Would you mind uploading the data from the most recent incidents again (the two crashes on appcluster0), with distinct names?
Thanks very much for your patience.
|
|
I'm currently uploading a file called ashdb2-second-crash-rocky8-mongo-6.0.6.tar. It has the coredump in it as well.
|
|
The data should be uploaded. I actually reset ashdb2 and enabled coredumps.. To reset it, I deleted all the files in /var/lib/mongo and /var/log/mongodb/ and restarted the node. It synced to appcluster0 about 20 minutes ago. I just got a notification that mongod crashed again. It's not exactly the same error about the checksum, but it is an error about reading collection-18 again. (We have had the same error before though). I will upload all the data for that as well as the coredump.
|
|
I used a link from above (but may have got the wrong one). I'm uploading to this link you just provided and I will bookmark it.
|
|
amit.gupta@opensense.com Where did you upload the data? I can't see them in the portal we were previously using. Can you try using this link? https://amphora.corp.mongodb.com/public/upload/eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJmb2xkZXJfaWQiOiIyMDYyMjkzMDk4MTgiLCJleHAiOjE2ODcwMjk3MDF9.z6PbRb8vaH-PwbLLt646SxYNctwo6_yBJSJQjdVlISw
|
|
I just uploaded it again.. It is called ashdb2-rockylinux8-mongod6.0.6-crash.tar.gz
I just checked the mongod.log on ashdb2 that just crashed, and it shows FCV 6:
{"t":{"$date":"2023-05-16T14:45:56.746-04:00"},"s":"I", "c":"REPL", "id":5853300, "ctx":"initandlisten","msg":"current featureCompatibilityVersion value","attr":{"featureCompatibilityVersion":"unset","context":"startup"}}
|
{"t":{"$date":"2023-05-16T14:46:38.753-04:00"},"s":"I", "c":"REPL", "id":5853300, "ctx":"ReplWriterWorker-0","msg":"current featureCompatibilityVersion value","attr":{"featureCompatibilityVersion":"6.0","context":"setFCV"}}
|
|
|
Also, I can't seem to see any uploaded data. Where did you upload it, and with what name?
|
|
Sure, I can give an example of a log-line that indicates FCV:
In the log output for the crash of pndb3 when it was part of mongocluster4 on 5/15, in the 4th log-line of the mongod logs provided, we see:
{"t":{"$date":"2023-05-15T00:00:01.834-04:00"},"s":"I", "c":"REPL", "id":5853300, "ctx":"SignalHandler","msg":"current featureCompatibilityVersion value","attr":{"featureCompatibilityVersion":"5.0","context":"log rotation"}}
|
|
logged.
|
|
Hi, data has been uploaded.
Sorry for the confusion. Here is what we did:
- created 3 node replica set cluster called appcluster0. All nodes rocky8.
- Yesterday I had done a test export/import of one of our databases using mongodump and mongorestore. It was imported into appcluster0 fine. I'm curious if you see any FCV5 stuff on this cluster.
- Since yesterday the replicaset has just been sitting idle with no apps or anything connected to it.
- I was planning on using this replicaset in production next week after doing another export/import and then changing our connection strings in our app.
That is interesting that FCV5 was on mongocluster4. I'm trying to remember clearly, but I'm pretty sure no data was even seeded into that cluster. Could you give me an example log-line that indicates FCV 5? I'm wondering if we have any data corruption in our config servers (although mongocluster4 was not even in the sharded cluster at this time so the config servers wouldn't affect it).
In any case, we should not have FCV5 anywhere.. I'm wondering if we had any corruption somewhere in teh cluster when we upgraded. We had a false start upgrade that we had to back out of last year because the upgrade didn't go smoothly. Not sure if there is any remnant of that. If it's easier to discuss on the phone, I can send you my contact info. I'm free whenever.
|
|
amit.gupta@opensense.com sorry to hear about the latest crash - let me know when you have uploaded the data.
I'm wondering, though, if you could tell me in detail what the state of the cluster was when the event happened. You mentioned that 'it hasn't yet been migrated to' yet, which made me think the cluster contained no data. However, the fact that the data files may contain sensitive information made me wonder if there is indeed some data on the cluster. How was the cluster seeded with data?
I was also wondering about this because in the previous set of incidents (the crashes on 'mongocluster4') you mentioned that there was no data on the cluster. However, my colleague noticed during log-inspection that the FCV version of the cluster was 5.0 - this should only be the case on 6.0+ binary-version nodes if they were started-up with existing data files or if the FCV was set manually. Did the cluster have any datafiles when it was seeded, or was the FCV set manually?
Any information about the state of the datafiles when the server was started up + throughout it's lifetime would be very helpful for us moving forward.
|
|
I can't seem to find the coredump on this newly built server. I will need to update some configs to capture it. So I will just upload the logs and the contents of /var/lib/mongo
|
|
we created a 3-node cluster with rocky linux 8 (called appcluster0). We were planning on migrating one of our databases to it. There is virtually no activity on it since it hasn't been migrated to. one of the servers in the cluster just crashed with basically the same error on collection-18 (though I'm not sure if it's the oplog). I'm going to upload the data, although I can't upload all the datafiles since there may be some sensitive data in some of the collections.
Error formatting macro: code: java.lang.StackOverflowError {"t":
{"$date":"2023-05-18T13:46:09.064-04:00"}
,"s":"E", "c":"WT", "id":22435, "ctx":"Checkpointer","msg":"WiredTiger error message","attr":{"error":0,"message":
{"ts_sec":1684431969,"ts_usec":64880,"thread":"5947:0x7f96b4112700","session_dhandle_name":"file:collection-18-297353196327239210.wt","session_name":"WT_SESSION.checkpoint","category":"WT_VERB_DEFAULT","category_id":9,"verbose_level":"ERROR","verbose_level_id":-3,"msg":"__wt_block_read_off:226:collection-18-297353196327239210.wt: potential hardware corruption, read checksum error for 4096B block at offset 4364562432: block header checksum of 0 doesn't match expected checksum of 0xd81caefc"}
}}
{"t":
{"$date":"2023-05-18T13:46:09.065-04:00"}
,"s":"E", "c":"WT", "id":22435, "ctx":"Checkpointer","msg":"WiredTiger error message","attr":{"error":0,"message":{"ts_sec":1684431969,"ts_usec":64987,"thread":"5947:0x7f96b4112700","session_dhandle_name":"file:collection-18-297353196327239210.wt","session_name":"WT_SESSION.checkpoint","category":"WT_VERB_DEFAULT","category_id":9,"verbose_level":"ERROR","verbose_level_id":-3,"msg":"__wt_bm_corrupt_dump:77:
{0: 4364562432, 4096, 0xd81caefc}
: (chunk 1 of 4): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "}}}
{"t":
{"$date":"2023-05-18T13:46:09.065-04:00"}
,"s":"E", "c":"WT", "id":22435, "ctx":"Checkpointer","msg":"WiredTiger error message","attr":{"error":0,"message":{"ts_sec":1684431969,"ts_usec":65089,"thread":"5947:0x7f96b4112700","session_dhandle_name":"file:collection-18-297353196327239210.wt","session_name":"WT_SESSION.checkpoint","category":"WT_VERB_DEFAULT","category_id":9,"verbose_level":"ERROR","verbose_level_id":-3,"msg":"__wt_bm_corrupt_dump:77:
{0: 4364562432, 4096, 0xd81caefc}
: (chunk 2 of 4): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "}}}
{"t":
{"$date":"2023-05-18T13:46:09.065-04:00"}
,"s":"E", "c":"WT", "id":22435, "ctx":"Checkpointer","msg":"WiredTiger error message","attr":{"error":0,"message":{"ts_sec":1684431969,"ts_usec":65187,"thread":"5947:0x7f96b4112700","session_dhandle_name":"file:collection-18-297353196327239210.wt","session_name":"WT_SESSION.checkpoint","category":"WT_VERB_DEFAULT","category_id":9,"verbose_level":"ERROR","verbose_level_id":-3,"msg":"__wt_bm_corrupt_dump:77:
{0: 4364562432, 4096, 0xd81caefc}
: (chunk 3 of 4): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "}}}
{"t":
{"$date":"2023-05-18T13:46:09.065-04:00"}
,"s":"E", "c":"WT", "id":22435, "ctx":"Checkpointer","msg":"WiredTiger error message","attr":{"error":0,"message":{"ts_sec":1684431969,"ts_usec":65282,"thread":"5947:0x7f96b4112700","session_dhandle_name":"file:collection-18-297353196327239210.wt","session_name":"WT_SESSION.checkpoint","category":"WT_VERB_DEFAULT","category_id":9,"verbose_level":"ERROR","verbose_level_id":-3,"msg":"__wt_bm_corrupt_dump:77:
{0: 4364562432, 4096, 0xd81caefc}
: (chunk 4 of 4): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 "}}}
{"t":
{"$date":"2023-05-18T13:46:09.065-04:00"}
,"s":"E", "c":"WT", "id":22435, "ctx":"Checkpointer","msg":"WiredTiger error message","attr":{"error":-31802,"message":
{"ts_sec":1684431969,"ts_usec":65321,"thread":"5947:0x7f96b4112700","session_dhandle_name":"file:collection-18-297353196327239210.wt","session_name":"WT_SESSION.checkpoint","category":"WT_VERB_DEFAULT","category_id":9,"verbose_level":"ERROR","verbose_level_id":-3,"msg":"__wt_block_read_off:235:collection-18-297353196327239210.wt: fatal read error","error_str":"WT_ERROR: non-specific WiredTiger error","error_code":-31802}
}}
{"t":
{"$date":"2023-05-18T13:46:09.065-04:00"}
,"s":"E", "c":"WT", "id":22435, "ctx":"Checkpointer","msg":"WiredTiger error message","attr":{"error":-31804,"message":
{"ts_sec":1684431969,"ts_usec":65334,"thread":"5947:0x7f96b4112700","session_dhandle_name":"file:collection-18-297353196327239210.wt","session_name":"WT_SESSION.checkpoint","category":"WT_VERB_DEFAULT","category_id":9,"verbose_level":"ERROR","verbose_level_id":-3,"msg":"__wt_block_read_off:235:the process must exit and restart","error_str":"WT_PANIC: WiredTiger library panic","error_code":-31804}
}}
{"t":
{"$date":"2023-05-18T13:46:09.065-04:00"}
,"s":"F", "c":"ASSERT", "id":23089, "ctx":"Checkpointer","msg":"Fatal assertion","attr":{"msgid":50853,"file":"src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp","line":712}}
|
|
I guess it's possible at this point, but I just can't see which process would be truncating the mongo data files or who would have access.. Also, I wouldn't have a script/process that would specifically target the Oplog either.
Also, it's not clear why servers that were otherwise working fine for months in mongocluster2 (vindb2 & ashdb2) would crash after it was put into mongocluster3. All we did was delete /var/lib/mongo/* and change /etc/mongod.conf's clustername.. Those were literally the only 2 changes made to the server, and then it started to crash after a couple days. If I didn't know better, I would say mongocluster3 is cursed.
|
|
Hey amit.gupta@opensense.com - I don't think it's possible for the type of corruption we're seeing to originate from replication. When replica-set secondaries fetch Oplog entries from other nodes, they expect to read valid BSON objects; in the datafiles for the Oplog that you've shared the, file isn't a structurally valid WiredTiger file. Had the file appeared this way while the process was running then it wouldn't be possible to transmit valid BSON to the secondaries.
I do think the correlation-in-time of the crashes is interesting. Do you think there's a chance any reporting or other process/script runs at around the same time on different machines? Based on the data-files, one theory we have is that another process is truncating the file (writing a single linefeed character into it) while MongoDB is running. If MongoDB didn't currently have the file open, we'd then see the single-byte datafile we saw in the first crash with datafiles you provided, along with the failure we saw. If MongoDB did have the file open while the truncation happened, then we'd see a mixture of a linefeed at the beginning of the file, many zero-bytes (written by the OS over all the data past the truncation point), and some blocks of valid data (written by WiredTiger concurrently with the truncation, at specific offsets in the file). This is exactly what the data files look like for the most recent two crashes that we have full data available from.
We're staying tuned to hear whether the move to RockyLinux and associated stopping of other processes running on the machines helps.
|
|
Thanks! I've checked the permission of the zabbix agent and it's running under a zabbix user. The /var/lib/mongo directory is owned by mongod:mongod and no other users can write to it. If we get the same crash, I will definitely turn off zabbix to eliminate it as a possible cause.
Out for curiousity, if replicaset member gets a corrupted Oplog entry (for whatever reason), is it possible for that entry to get replicated to other nodes? I'm trying to wrap my head around how it seems like the crashes spread once a node crashed. Ie, is it possible that the "\n" character got send over the wire to another node? Or is it more likely the other nodes independently corrupted the file themselves?
|
|
Thanks for the update amit.gupta@opensense.com , and I totally understand the focus on stability. If possible, I would recommend not running the Zabbix agent - I know it shouldn't interact with the MongoDB files but it would just help us eliminate another potential variable. If it's needed, I'd also recommend setting up auditd as mentioned above (unfortunately I don't have more specific instructions for Rocky Linux) and by using groups/permissions to ensure that the only process that can write to the data directory is mongod. I hope this helps.
|
|
Quick update. Last night we rebuilt/resynced all of mongocluster3 with a 5-node replicaset that is entirely on RockyLinux8 (installed manually). The only other piece of software on the servers is our zabbix monitoring agent which has been curtailed to just do basic system monitoring. It should have no interaction with MongoDB files at all.
Sorry for the haphazard jump to this new OS, but I needed to focus on getting stability in the cluster. If mongod crashes on these Rocky8 systems, I think we will have a lot more clues.
|
|
Thanks George,
I'm rebuilding the servers in mongocluster3 with RockyLinux8. They will be barebones with no other software running. No logdna or zabbix etc.
I'll research how to make auditd watch the mongodb files. Thanks for that tip! If you have any guidance on this, that would be awesome. I'm just scrambling to get our cluster back at the moment.
|
|
Hi amit.gupta@opensense.com ,
Sorry to hear you are still having issues. I've taken time today to look through the data you uploaded for the crashes on daldb3 and pndb3 on 6.0.6. Here's what I've found so far:
- In each case, the crash is due to a WiredTiger panic when trying to access the file stored at collection-18-XXXXXXXXX.wt
- The catalog files you provided (Thank You!) have allowed us to confirm that in each case, this collection is the Oplog.
- In each case, WiredTiger panics because the checksum of the block it is trying to read from the file is 0 instead of the expected true-value
- Our diagnostic tooling is unable to interpret the files as valid WiredTiger data files
- I spent some time inspecting the binary data/bytes in the files manually; here's what I found:
- In each case, the file begins with a newline character followed by null bytes. The null bytes fill an entire 'block' - the unit the WiredTiger does I/O with the filesystem on, 4096 bytes. There is no valid header data
- Many of the blocks are entirely or almost entirely null-bytes. Looking at the blocks that are entirely null bytes (possibly including 1 newline or other char), we see (118/237 and 119/230) when looking at the two crashes respectively.
- The other blocks appear to contain some data but it is unclear what. We'll continue investigating this data.
Note the presence of the newline-byte at the very beginning - in previous instances of this failure, where the data files are preserved, we saw that the data-file was one-byte in size and contained only a newline character, so this is a distinctive commonality (we would never expect WiredTiger/MongoDB to write this byte).
I'm going to continue looking over the data and discussing it with my colleagues, but unfortunately the storage team is not optimistic that we can recover much more information from the data files.
Lastly: is there any way you would be able to stop the running of all other processes on this replica set (i.e. logdna-agent; any cron jobs or management scripts run by you) for a while to see if this stops the issue from reoccurring? Additionally you may be able to use "auditd" to identify what processes, if any, are accessing the files. This would be hugely helpful, because we can then isolate the issue to WiredTiger/MongoDB's management of the data files.
|
|
and another coredump on ashdb2 (which is a new server in mongocluster3). So we lost majority on mongocluster3.. these are all new servers in this cluster..
|
|
just got another crash in our production node in mongocluster3 (vindb2). Now I'm totally at a loss on how this could happen. This is a brand new node in mongocluster3. It has never had a problem in the past while it was part of mongocluster2.
|
|
I've uploaded logs and the full /var/lib/mongo directory for daldb3 and pndb3 (they are running 6.0.6). These nodes are in the mongocluster4 replicaset (which has no data in it, nor has it been added to our sharded cluster) These servers are in different data centers, with different bare metal vendors. hildb3 (which crashed on friday). is actually a third datacenter and 3rd provider.
|
|
Quick updates. We found some interesting things. I have been working with consulting services and we were planning on creating a mongocluster4 replicaset and then migrating all chunks from mongocluster3 to mongocluster4. We took some of the servers in mongocluster3 to create the new cluster. Mongocluster4 is barebones and has no data. Before we added mongocluster4 to our sharded cluster one of the servers (hildb3) crash after about 15 minutes on Friday. Today I just had 2 more servers crash in that mongocluster4 (daldb3 then pndb3) in short succession. There is absolutely no data inserted into that database by us. Our current mongocluster3 shard is made up of a 3-node replicaset that is comprised of totally new/different servers.
I did a full hardware test on hildb3 on saturday but everything came back clean. I also had removed hildb3 from mongocluster4 since I have a suspicion that another server in the cluster may also have the issue. Since daldb3 and pndb3 are both in crashed states right now, and they are not part of our production environment, I'm happy to gather all the data for you. We can instrument this cluster as much as you want.
Is there any possibility that a corrupted oplog or some other replicaset communication can "poison" another node in a replica set?
|
|
amit.gupta@opensense.com you should be able to use the "ls -i" command to find the inode of each of the WiredTiger datafiles. But I wanted to warn again that this may produce a large amount of output if the servers stay up for a long time, so it may not be the best choice if the problem isn't reproducing regularly/reliably, and the output is too extensive or if you are not able to trim the output regularly .
|
|
cool, I'll run it in a tmux session. But the output doesn't seem to show the filename so I'm not sure what we would be grepping for? It seems like it just has inode info.
|
|
Hi amit.gupta@opensense.com,
Yes, you need to have perf record running when the crash happens - you should be able to run it without too much overhead, depending on how often that tracepoint is hit. Unfortunately, I don't believe there's any way to teach the `perf` tool to only count trace-points for a particular file; you'll need to do that filtering afterword on the output file by using a tool such as grep.
|
|
Regarding perf record, I just tried running it in a window.. it seems like when I run it in a terminal, it keeps sampling until I hit ^C:
|
perf record -e 'xfs:xfs_itruncate_extents_start'
|
^C[ perf record: Woken up 1 times to write data ]
|
[ perf record: Captured and wrote 0.585 MB perf.data (24 samples) ]
|
The output of "perf script" during this time is this: (the samples were taken over about 30 seconds).
perf script
|
kworker/u32:0-i 1297032 [005] 511788.806340: xfs:xfs_itruncate_extents_start: dev 259:7 ino 0x401e3074 disize 0x1cd325 new_size 0x1cd325
|
ftdc 1187716 [002] 511797.562434: xfs:xfs_itruncate_extents_start: dev 259:7 ino 0x40258480 disize 0x0 new_size 0x6c31
|
kworker/2:2-eve 1292113 [002] 511798.559400: xfs:xfs_itruncate_extents_start: dev 259:7 ino 0x401e3062 disize 0x0 new_size 0x0
|
ftdc 1187716 [002] 511807.563272: xfs:xfs_itruncate_extents_start: dev 259:7 ino 0x401e3062 disize 0x0 new_size 0x758b
|
kworker/2:2-eve 1292113 [002] 511808.559302: xfs:xfs_itruncate_extents_start: dev 259:7 ino 0x40258480 disize 0x0 new_size 0x0
|
ftdc 1187716 [002] 511817.562918: xfs:xfs_itruncate_extents_start: dev 259:7 ino 0x40258480 disize 0x0 new_size 0x7f3c
|
kworker/2:2-eve 1292113 [002] 511818.559202: xfs:xfs_itruncate_extents_start: dev 259:7 ino 0x401e3062 disize 0x0 new_size 0x0
|
ftdc 1187716 [002] 511827.563508: xfs:xfs_itruncate_extents_start: dev 259:7 ino 0x401e3062 disize 0x0 new_size 0x87fa
|
kworker/2:2-eve 1292113 [002] 511828.556773: xfs:xfs_itruncate_extents_start: dev 259:7 ino 0x40258480 disize 0x0 new_size 0x0
|
xauth 1303217 [011] 511829.268470: xfs:xfs_itruncate_extents_start: dev 259:7 ino 0x2001e1dd disize 0x0 new_size 0x1c2
|
kworker/11:2-md 1190303 [011] 511829.556012: xfs:xfs_itruncate_extents_start: dev 259:7 ino 0x2001e1dc disize 0x0 new_size 0x0
|
perf 1303689 [003] 511836.855789: xfs:xfs_itruncate_extents_start: dev 259:7 ino 0x20000216 disize 0x43f48 new_size 0x43f48
|
pager 1303693 [002] 511836.856667: xfs:xfs_itruncate_extents_start: dev 259:7 ino 0x2001e1da disize 0x0 new_size 0x14
|
kworker/2:2-eve 1292113 [002] 511837.559262: xfs:xfs_itruncate_extents_start: dev 259:7 ino 0x2001e1c4 disize 0x0 new_size 0x0
|
ftdc 1187716 [010] 511837.563496: xfs:xfs_itruncate_extents_start: dev 259:7 ino 0x40258480 disize 0x0 new_size 0x92f5
|
kworker/10:0-mm 1301364 [010] 511838.559331: xfs:xfs_itruncate_extents_start: dev 259:7 ino 0x401e3062 disize 0x0 new_size 0x0
|
Checkpointer 1187714 [014] 511845.494189: xfs:xfs_itruncate_extents_start: dev 259:7 ino 0x202f8943 disize 0x7000 new_size 0x7000
|
Checkpointer 1187714 [014] 511845.502378: xfs:xfs_itruncate_extents_start: dev 259:7 ino 0x201df1a8 disize 0x9000 new_size 0x9000
|
Checkpointer 1187714 [014] 511845.507476: xfs:xfs_itruncate_extents_start: dev 259:7 ino 0x201defef disize 0x9b000 new_size 0x9b000
|
kworker/14:0-ev 1262439 [014] 511846.559169: xfs:xfs_itruncate_extents_start: dev 259:7 ino 0x203635bc disize 0x0 new_size 0x0
|
ftdc 1187716 [010] 511847.563349: xfs:xfs_itruncate_extents_start: dev 259:7 ino 0x401e3062 disize 0x0 new_size 0x9ed1
|
kworker/10:0-mm 1301364 [010] 511848.559087: xfs:xfs_itruncate_extents_start: dev 259:7 ino 0x40258480 disize 0x0 new_size 0x0
|
pager 1303704 [002] 511849.133798: xfs:xfs_itruncate_extents_start: dev 259:7 ino 0x2001e1c4 disize 0x0 new_size 0x14
|
kworker/2:2-eve 1292113 [002] 511849.559429: xfs:xfs_itruncate_extents_start: dev 259:7 ino 0x2001e1da disize 0x0 new_size 0x0
|
Should I be running the perf record in the background? Also, it's not clear to me how to associate this perf data with the specific mongo wiredtiger files.
|
|
Thanks George! Quick notes
- I've added those WiredTiger files to our information gathering script.
- I'll spend some time today researching the xfs perf related items you discussed. Thank you for this tip! We do not change the default configs for XFS as far as I know.
- The logdna-agent monitors logs file. While there are a handful of other processes on the server (like postfix for sending local emails), none of them should interfere with mongo files. I thought you were asking about other processes such as virus scanners or the sort. We don't run those on these nodes.
- In the past, when we made backups of the /var/lib/mongo directories, it was only after we stopped mongo. Never while running. Going forward we will only sync via adding a blank node to the replicaset.
Yesterday I saw 2 servers in the cluster, daldb3, and vindb3 which had oplog windows of about 70 days. This was strange to me because I could have sworn that I had resynced those nodes from the network within the last couple weeks. (Whenever I say "from the network", I'm referring to the default way Mongo syncs a new node in via rs.add()). In any case, I've resynced them completely by deleting their data directories and adding them to the cluster. I'm optimistic that maybe clearing out all these oplog collections may do the trick if there was a potential corruption in there.
|
|
Hi amit.gupta@opensense.com,
I've spent some time taking a look at the additional information and discussing it with my colleagues and wanted to share some information.
First, I investigated the single-byte WiredTiger file - recall that the crashes are caused because WiredTiger cannot read this file. I confirmed with the storage engines team that WiredTiger/MongoDB should never leave a file in this state; even an "empty" WiredTiger file should have at least a file header of 4KB. This would be a very strange bug if MongoDB is leaving the file in this state. I then checked with members of the support team; they have seen this error before in several cases, and in each instance they had seen the error in the past, it was because the storage system or some other 'helpful' process on the machine had truncated the file while MongoDB was running. I checked the `ps` output you provided, and I see that MongoDB is not the only process running on this machine - in fact, in the minute before the crash, I see a process called logdna-agent running and reading files in the /var directory, and many other processes are also active. Are you certain none of these programs are interfering with the mongodb data-files directory? Additionally, it's possible the storage system itself (the operating system or filesystem and anything below it) is truncating the file due to some erroneous configuration. Since you are using XFS, you should be able to use the linux `perf` tool to discover the origin of the truncation if this is indeed the case, like so:
sudo perf record -e 'xfs:xfs_itruncate_extents_start'
|
and later running perf script to process the output; here's an example of the output:
$ sudo perf script
|
truncate 23845 [004] 9760661.627601: xfs:xfs_itruncate_extents_start: dev 259:0 ino 0x4c9aff size 0x1 new_size 0x1
|
Thinking of other ways the data-files could have been corrupted or truncated, I recalled that you mentioned that you were restoring nodes sometimes from local copies/backups of the datafiles . I wanted to clarify and emphasize - making local copies of the datafiles while mongoDB is running is not a supported backup method and could cause data corruption of the data files, causing issues like the one you are seeing. If you are doing any manual manipulation/backup/copying of the datafiles while the mongod process is running, I would advise you to stop as this could be causing corruption and forcing WiredTiger to panic the mongod process.
Lastly, I wanted to provide some information about what information we would need to continue the investigation, if the above doesn't resolve the issue. In the future, please continue to provide full core-dumps and all of the additional information you provided in the last full-upload of a crash (I didn't see any data for the most recent two crashes). Similarly, please also include the following files from the data directory at the time of the crash:
WiredTiger
|
WiredTiger.lock
|
WiredTiger.turtle
|
WiredTiger.wt
|
WiredTigerHS.wt
|
_mdb_catalog.wt
|
<whatever-collection-is-causing-the-panic-read>.wt
|
These files, along with the additional information, would allow us to confirm the identity of the collection backed by the affected WiredTiger file (i.e. to confirm it is the OpLog as you suspect, or something else), and would allow me to pass this to the storage team to effectively look into the source of the potential issue if it is in MongoDB. If you confirm it is indeed the OpLog, I think running:
on each crashed node after the crash is a good idea and may help us find out more.
However, I do want to emphasize that in past cases where we have seen this particular issue, it is because of outside interference (i.e. the storage system or another process truncating the datafile), and that copying or otherwise modifying the datafiles while mongod is running is not supported and could cause datafile corruption as well.
I hope this helps and we can resolve this issue soon.
|
|
Note for future:
I ran an operation to temporarily reduce the oplog size, compact it, then increase it on all the secondary nodes. Prior to running the commands, 2 nodes had Oplog Windows of 60+ days (daldb3 and vindb3).
use local
|
db.oplog.rs.stats().maxSize
|
// default oplog size was 46404326656
|
db.adminCommand({replSetResizeOplog: 1, size: Double(1000)})
|
db.runCommand({ "compact" : "oplog.rs" } )
|
// restore oplog
|
db.adminCommand({replSetResizeOplog: 1, size: Double(20000)})
|
This is just a stab in the dark in case something in the oplog files got corrupted during our previous issues when we were running 5.x and 6.0.4.
|
|
I had a thought this morning. I remember trying to figure out which collection the wiredtiger that is getting corrupted is referring to. If i recall correctly, the corrupted collection is the oplog collection and not our large activity collection. Do you think we should we be running validate({full: true}) on the oplog and/or other local collections as well?
|
|
We had a couple days of peace.. But this morning we had crashes on 2 different servers. Same exact backtraces as before.
pndb3 at 7:36am EST
daldb3 at 8:08 EST
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
|
--Type <RET> for more, q to quit, c to continue without paging--
|
Core was generated by `/usr/bin/mongod --config /etc/mongod.conf'.
|
Program terminated with signal SIGABRT, Aborted.
|
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140268766950976) at ./nptl/pthread_kill.c:44
|
44 ./nptl/pthread_kill.c: No such file or directory.
|
[Current thread is 1 (Thread 0x7f92de06f640 (LWP 338400))]
|
(gdb) bt
|
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140268766950976) at ./nptl/pthread_kill.c:44
|
#1 __pthread_kill_internal (signo=6, threadid=140268766950976) at ./nptl/pthread_kill.c:78
|
#2 __GI___pthread_kill (threadid=140268766950976, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
|
#3 0x00007f92e72d2476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
|
#4 0x00007f92e72b87f3 in __GI_abort () at ./stdlib/abort.c:79
|
#5 0x0000562ac01abd62 in mongo::fassertFailedWithLocation(int, char const*, unsigned int) ()
|
#6 0x0000562abfd5c990 in mongo::(anonymous namespace)::mdb_handle_error_with_startup_suppression(__wt_event_handler*, __wt_session*, int, char const*) [clone .cold.1418] ()
|
#7 0x0000562ac0d20c81 in __eventv ()
|
#8 0x0000562abfd70bc7 in __wt_panic_func ()
|
#9 0x0000562abfd8183a in __wt_block_checkpoint_resolve.cold.7 ()
|
#10 0x0000562ac0ce4472 in __wt_meta_track_off ()
|
#11 0x0000562ac0d40af5 in __txn_checkpoint_wrapper ()
|
#12 0x0000562ac0d419b6 in __wt_txn_checkpoint ()
|
#13 0x0000562abfd6fd72 in __session_checkpoint.cold.48 ()
|
#14 0x0000562ac0c46d74 in mongo::WiredTigerKVEngine::_checkpoint(__wt_session*) ()
|
#15 0x0000562ac0c49230 in mongo::WiredTigerKVEngine::checkpoint() ()
|
#16 0x0000562ac148cd74 in mongo::Checkpointer::run() ()
|
#17 0x0000562ac2e9ab16 in mongo::BackgroundJob::jobBody() ()
|
#18 0x0000562ac2e9b19c in std::thread::_State_impl<std::thread::_Invoker<std::tuple<mongo::stdx::thread::thread<mongo::BackgroundJob::go()::{lambda()#2}, , 0>(mongo::BackgroundJob::go()::{lambda()#2})::{lambda()#1}> > >::_M_run() ()
|
#19 0x0000562ac325cc5f in execute_native_thread_routine ()
|
#20 0x00007f92e7324b43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
|
#21 0x00007f92e73b6a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
|
We are thinking of doing an update to all documents in the database to remove a field.. Then run a rolling compact() on the db. Will that process also validate the entire collection?
Also, based on these logs, do you think we should increase the debug logging for wiredtiger?
|
|
hildb3 crashed again approximately 2 hours after these checks were completed. The error seems the same/similar. I'm uploading the coredump and mongod.log. Unfortunately my colleague didn't collect anything else because it was middle of the night.
I am in the process of replacing hildb3 with a brand new server, tamdb3. I want to remove hildb3 from the equation for a while since it seems to be the only one crashing in the last 2 instances. I do not see any indication of hardware errors in the logs, but who knows.
|
|
Just a quick note for future reference. We ran the validate({full: true}) on all 5 nodes after today's crash on hildb3. All came back without errors or warnings.
|
|
Thanks george.wangensteen@mongodb.com I did not get a chance to complete the replicaSet maintenance operations on 3 of the 5 nodes. I ran it on daldb3 and vindb3. Coincidentally neither of those 2 nodes crashed today! Only hildb3 crashed today. I'm going to run the repair on the rest of the nodes tonight (hildb3, ashdb3, pndb3).
Is it possible that the "validate({full: true})" may have fixed something even though it didn't report any errors or warnings?
I just checked the /var/lib/mongo.crashed/collection-18-3750969587401846988.wt file and it is 1 byte. Here is the hexdump of it:
# hexdump collection-18-3750969587401846988.wt
|
0000000 000a
|
0000001
|
However, the working copy that is currently on /var/lib/mongo/collection-18-3750969587401846988.wt is 1.3Gb. Do you want me to upload the current working copy?
|
|
I took an initial look at the core dump and associated data and wanted to leave some notes:
- The server is failing due to a WiredTiger panic that originates with this error:
"ctx":"Checkpointer","msg":"WiredTiger error message","attr":{"error":-31802,"message":{"ts_sec":1683314632,"ts_usec":10704,"thread":"346695:0x7f007bfec640","session_dhandle_name":"file:collection-18-3750969587401846988.wt","session_name":"WT_SESSION.checkpoint","category":"WT_VERB_DEFAULT","category_id":9,"verbose_level":"ERROR","verbose_level_id":-3,"msg":"int __posix_file_read(WT_FILE_HANDLE *, WT_SESSION *, wt_off_t, size_t, void *):435:/var/lib/mongo/collection-18-3750969587401846988.wt: handle-read: pread: failed to read 4096 bytes at offset 24027136","error_str":"WT_ERROR: non-specific WiredTiger error","error_code":-31802}}}
|
|
- This is, as far as I can tell, the same WiredTiger error experienced in the past, including the same amount of bytes read and occurring in the same context (as part of the checkpointer thread attempting to take a checkpoint).
amit.gupta@opensense.com , were you able to perform full replica-set maintenance process that I recommended above, in the section of my comment about WiredTiger errors? Additionally, were you able to run validate with full: true on every node, as you mentioned you started last night? As mentioned above: this error is indicative of data file corruption, possibly originating with hardware, and is very unlikely to go away so long as the same data files continue being used. To remediate this issue, I would refer back to the steps I recommended above: completing replica set maintenance/validate process, and doing a full initial sync of this node over the network, ideally from a node that has passed validation. Please do not continue to recover from local copies of the datafiles as you mentioned you had tried above - the issue will persist if the data files are indeed corrupted.
To try and confirm the corruption, could you please upload the data-file in question:
/var/lib/mongo/collection-18-3750969587401846988.wt
|
? If that's possible, I can ask the Storage Engines/WiredTiger team to take a look and see if they can find anything.
I hope this helps.
I'm going to jot down a few other notes from my investigation, in case it's useful for the future:
|
|
george.wangensteen@mongodb.com the coredump/diagnostics upload is complete. I also have a backup of the /var/ib/mongo directory on the server which contains corrupted files. If there are any files in there that you need me to upload, please let me know.
|
|
Got it.. yes, I think our current issues are related to this SIGABRT. The other segfault issues and wiredtiger issues may either be from previous issues in mongos, or subsequent issues that arise after the SIGABRT. Just a hunch. Interesting that only 1 node crashed so far. I'm expecting more based on previous experience.
The core dump and related logs are uploading now. It's about 18gb.
Here is the error i found in syslog in case it's helpful. Apologies for the formatting:
May 5 15:24:14 hildb3 systemd-coredump[733507]: Process 346695 (mongod) of user 120 dumped core.#012#012Found module linux-vdso.so.1 with build-id: c6f44612739b13c60cb58d238c694963cc3f7251#012Found module libffi.so.8 with build-id: 59c2a6b204f74f358ca7711d2dfd349d88711f6a#012Found module libkeyutils.so.1 with build-id: ff27227afa5eeddccab180dd29bd7fcff94aea7c#012Found module libtasn1.so.6 with build-id: 2fde6ecb43c586fe4077118f771077aa1298e7ea#012Found module libp11-kit.so.0 with build-id: a0ffe1d002de5812dc718186172efb78604ddf2c#012Found module libbrotlicommon.so.1 with build-id: 43a72967cf84155914c8b3e915926733d1e57c11#012Found module libsasl2.so.2 with build-id: b8a924f277180f8743a0c6463a379b1a4ad1aae5#012Found module libkrb5support.so.0 with build-id: 60c3bf331f85f4be0c25e701818c2cd36b1b8b9c#012Found module libcom_err.so.2 with build-id: ce0901f10854b3c9276066b98d9a72303206e0d5#012Found module libk5crypto.so.3 with build-id: 48fa2634690fb84d81748073f60fe2759fc09ba8#012Found module libkrb5.so.3 with build-id: ae22aca5575bb9ec63eb47a7c2d04ad9802d92d4#012Found module libgmp.so.10 with build-id: f110719303ddbea25a5e89ff730fec520eed67b0#012Found module libnettle.so.8 with build-id: 3d9c6bf106ef53d625b7b1c8bb1300e84598a74a#012Found module libhogweed.so.6 with build-id: 3cc4a3474de72db89e9dcc93bfb95fe377f48c37#012Found module libgnutls.so.30 with build-id: ededb83d3498ef77e1f7011b79f025efd7eb498f#012Found module libunistring.so.2 with build-id: ca5149da8d5a298b8f286ffca3d6e2402ec0fe01#012Found module libz.so.1 with build-id: 30840b79ac329ecbf1dec0bb60180eed256d319f#012Found module libbrotlidec.so.1 with build-id: 4b1f390dd6e24d49684db8b2443d082379e8e977#012Found module libzstd.so.1 with build-id: 5d9d0d946a3154a748e87e17af9d14764519237b#012Found module liblber-2.5.so.0 with build-id: 975714e9cbc0f4c3429d01b2f0498dc269e971e3#012Found module libldap-2.5.so.0 with build-id: 6ccf1d949a0b80bd01102fbff7e23f05c6696cb8#012Found module libgssapi_krb5.so.2 with build-id: 1b957c0ba6d4ac8e5db64652bbcf553c3ebad23b#012Found module libpsl.so.5 with build-id: 2b1afc1a3bc8bdb016e432c50db058632e7895b9#012Found module libssh.so.4 with build-id: f6a2eb7d2d6c5a8e3b7a9b2dcb3fa9ab394cf2b5#012Found module librtmp.so.1 with build-id: 9517ef375cd71ea3da824b4118f1599735093d66#012Found module libidn2.so.0 with build-id: 45b73e0e1c46a76be22f572ee98c60af5768bf8f#012Found module libnghttp2.so.14 with build-id: 90a67111383c58bfff9fac96da818cc62e5b68c9#012Found module ld-linux-x86-64.so.2 with build-id: 61ef896a699bb1c2e4e231642b2e1688b2f1a61e#012Found module libc.so.6 with build-id: 69389d485a9793dbe873f0ea2c93e02efaa9aa3d#012Found module libgcc_s.so.1 with build-id: 09c4935b79388431a1248f6a98e00d7dc81b8513#012Found module libssl.so.3 with build-id: 4bc97c5bb581ccfe9ae803981f527ce321f16c7a#012Found module libcrypto.so.3 with build-id: 62ba5ee88d663a2396160fed1a1864f1f3b60103#012Found module libresolv.so.2 with build-id: 7fd7253c61aa6fce2b7e13851c15afa14a5ab160#012Found module libm.so.6 with build-id: 27e82301dba6c3f644404d504e1bb1c97894b433#012Found module liblzma.so.5 with build-id: b85da6c48eb60a646615392559483b93617ef265#012Found module libcurl.so.4 with build-id: 107825dc146cd881c29e08cff116713201458095#012Found module mongod with build-id: 0b953ed8d26ffcc5#012Stack trace of thread 631815:#012#0 0x00007f008ea19a7c pthread_kill (libc.so.6 + 0x96a7c)#012#1 0x00007f008e9c5476 raise (libc.so.6 + 0x42476)#012#2 0x00007f008e9ab7f3 abort (libc.so.6 + 0x287f3)#012#3 0x0000556a233d7bcc _ZN5mongo25fassertFailedWithLocationEiPKcj (mongod + 0x1186cbcc)#012#4 0x0000556a19af796b _ZN5mongo12_GLOBAL__N_141mdb_handle_error_with_startup_suppressionEP18__wt_event_handlerP12__wt_sessioniPKc (mongod + 0x7f8c96b)#012#5 0x0000556a19fe80cc __eventv (mongod + 0x847d0cc)#012#6 0x0000556a19fe9c80 __wt_panic_func (mongod + 0x847ec80)#012#7 0x0000556a19b81fa9 __wt_block_checkpoint_resolve (mongod + 0x8016fa9)#012#8 0x0000556a19e026ea __wt_meta_track_off (mongod + 0x82976ea)#012#9 0x0000556a1a0bba61 __txn_checkpoint_wrapper (mongod + 0x8550a61)#012#10 0x0000556a1a0b9fe9 __wt_txn_checkpoint (mongod + 0x854efe9)#012#11 0x0000556a19f76b2d __session_checkpoint (mongod + 0x840bb2d)#012#12 0x0000556a199db4cf _ZN5mongo18WiredTigerKVEngine11_checkpointEP12__wt_session (mongod + 0x7e704cf)#012#13 0x0000556a199fbcb9 _ZN5mongo18WiredTigerKVEngine10checkpointEv (mongod + 0x7e90cb9)#012#14 0x0000556a1bcbdc9c _ZN5mongo12Checkpointer3runEv (mongod + 0xa152c9c)#012#15 0x0000556a22ce9cb0 _ZN5mongo13BackgroundJob7jobBodyEv (mongod + 0x1117ecb0)#012#16 0x0000556a22cefd12 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5mongo4stdx6threadC1IZNS3_13BackgroundJob2goEvE3$_0JELi0EEET_DpOT0_EUlvE_EEEEE6_M_runEv (mongod + 0x11184d12)#012#17 0x0000556a237e517f execute_native_thread_routine (mongod + 0x11c7a17f)#012#18 0x00007f008ea17b43 n/a (libc.so.6 + 0x94b43)#012#19 0x00007f008eaa9a00 n/a (libc.so.6 + 0x126a00)#012#012Stack trace of thread 346700:#012#0 0x00007f008ea14197 n/a (libc.so.6 + 0x91197)#012#1 0x00007f008ea16f1b pthread_cond_timedwait (libc.so.6 + 0x93f1b)#012#2 0x0000556a17246fca _ZNSt3_V222condition_variable_any10wait_untilIN5mongo20BasicLockableAdapterENSt6chrono3_V212system_clockENS4_8durationIlSt5ratioILl1ELl1000000000EEEEEESt9cv_statusRT_RKNS4_10time_pointIT0_T1_EE (mongod + 0x56dbfca)#012#3 0x0000556a22e87a88 _ZN5mongo11ClockSource21waitForConditionUntilERNS_4stdx18condition_variableENS_20BasicLockableAdapterENS_6Date_tEPNS_8WaitableE (mongod + 0x1131ca88)#012#4 0x0000556a1907f2d4 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5mongo4stdx6threadC1IZNS3_18PeriodicRunnerImpl15PeriodicJobImpl4_runEvE3$_0JELi0EEET_DpOT0_EUlvE_EEEEE6_M_runEv (mongod + 0x75142d4)#012#5 0x0000556a237e517f execute_native_thread_routine (mongod + 0x11c7a17f)#012#6 0x00007f008ea17b43 n/a (libc.so.6 + 0x94b43)#012#7 0x00007f008eaa9a00 n/a (libc.so.6 + 0x126a00)#012#012Stack trace of thread 346735:#012#0 0x00007f008ea14197 n/a (libc.so.6 + 0x91197)#012#1 0x00007f008ea16f1b pthread_cond_timedwait (libc.so.6 + 0x93f1b)#012#2 0x0000556a172f3de3 _ZNSt3_V222condition_variable_any10wait_untilISt11unique_lockIN5mongo12latch_detail5LatchEENSt6chrono3_V212system_clockENS7_8durationIlSt5ratioILl1ELl1000000000EEEEEESt9cv_statusRT_RKNS7_10time_pointIT0_T1_EE (mongod + 0x5788de3)#012#3 0x0000556a19a1b52e _ZN5mongo18WiredTigerKVEngine24WiredTigerSessionSweeper3runEv (mongod + 0x7eb052e)#012#4 0x0000556a22ce9cb0 _ZN5mongo13BackgroundJob7jobBodyEv (mongod + 0x1117ecb0)#012#5 0x0000556a22cefd12 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5mongo4stdx6threadC1IZNS3_13BackgroundJob2goEvE3$_0JELi0EEET_DpOT0_EUlvE_EEEEE6_M_runEv (mongod + 0x11184d12)#012#6 0x0000556a237e517f execute_native_thread_routine (mongod + 0x11c7a17f)#012#7 0x00007f008ea17b43 n/a (libc.so.6 + 0x94b43)#012#8 0x00007f008eaa9a00 n/a (libc.so.6 + 0x126a00)#012#012Stack trace of thread 346702:#012#0 0x00007f008ea14197 n/a (libc.so.6 + 0x91197)#012#1 0x00007f008ea16f1b pthread_cond_timedwait (libc.so.6 + 0x93f1b)#012#2 0x0000556a19e2b922 __wt_cond_wait_signal (mongod + 0x82c0922)#012#3 0x0000556a19c76e64 __tiered_server (mongod + 0x810be64)#012#4 0x00007f008ea17b43 n/a (libc.so.6 + 0x94b43)#012#5 0x00007f008eaa9a00 n/a (libc.so.6 + 0x126a00)#012#012Stack trace of thread 346699:#012#0 0x00007f008ea14197 n/a (libc.so.6 + 0x91197)#012#1 0x00007f008ea16ac1 pthread_cond_wait (libc.so.6 + 0x93ac1)#012#2 0x0000556a237e136c _ZNSt18condition_variable4waitERSt11unique_lockISt5mutexE (mongod + 0x11c7636c)#012#3 0x0000556a1775c21f _ZNSt3_V222condition_variable_any4waitISt11unique_lockIN5mongo12latch_detail5LatchEEEEvRT_ (mongod + 0x5bf121f)#012#4 0x0000556a22cfd6ae _ZN5mongo10ThreadPool4Impl13_consumeTasksEv (mongod + 0x111926ae)#012#5 0x0000556a22cfc51b _ZN5mongo10ThreadPool4Impl17_workerThreadBodyERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE (mongod + 0x1119151b)#012#6 0x0000556a22d00de2 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5mongo4stdx6threadC1IZNS3_10ThreadPool4Impl25_startWorkerThread_inlockEvE4$_21JELi0EEET_DpOT0_EUlvE_E
|
|
|
Hey amit.gupta@opensense.com. Unfortunately it looks like this failure is due to the process receiving or raising a SIGABRT signal, rather than some of the segmentation fault (SIGSEGV) errors we saw previously. Unfortunately, this sort of crash (as well as some of the others you have experienced, such as the WiredTiger panics), isn't as obviously indicative of any memory corruption or mismanagement, and may not be caused by any ASAN detectable error. Hopefully, if a future crash is due to a segmentation fault, ASAN will find something and log it.
Once you have uploaded the core dump and other information I will take a look through it to see if there is any new evidence/clues. Thanks, George
|
|
Just got a coredump on one of the nodes. Gathering all the information and will upload it. However, there is no ASan log file. Here is the output of "systemctl status mongod"
root@hildb3:/mongodb/var/logs/mongod # ls -al /usr/bin/mongod
|
-rwxr-xr-x 1 root root 2936743384 May 2 23:24 /usr/bin/mongod
|
root@hildb3:/mongodb/var/logs/mongod # systemctl status mongod
|
× mongod.service - MongoDB Database Server
|
Loaded: loaded (/etc/systemd/system/mongod.service; disabled; vendor preset: enabled)
|
Active: failed (Result: signal) since Fri 2023-05-05 15:24:14 EDT; 6min ago
|
Docs: https://docs.mongodb.org/manual
|
Process: 346695 ExecStart=/usr/bin/numactl --interleave=all /usr/bin/mongod --config /etc/mongod.conf (code=killed, signal=ABRT)
|
Main PID: 346695 (code=killed, signal=ABRT)
|
CPU: 3h 46min 10.842s
|
May 04 01:07:41 hildb3 systemd[1]: Started MongoDB Database Server.
|
May 05 15:24:14 hildb3 systemd[1]: mongod.service: Main process exited, code=killed, status=6/ABRT
|
May 05 15:24:14 hildb3 systemd[1]: mongod.service: Failed with result 'signal'.
|
May 05 15:24:14 hildb3 systemd[1]: mongod.service: Consumed 3h 46min 10.842s CPU time.
|
the directory is world writable:
ls -al /mongodb/var/logs/mongod
|
total 0
|
drwxr-xrwx 2 mongodb mongodb 6 May 4 13:12 .
|
drwxr-xrwx 3 mongodb mongodb 20 May 4 13:12 ..
|
|
|
|
chris.kelly@mongodb.com george.wangensteen@mongodb.com I just uploaded to the new link. Thanks!
eric.sedor@mongodb.com quick fyi, the first secondary node completed validate({full: true}) and it came back with no errors or warnings. I will continue with all the nodes through the evening so that we start tomorrow with confirmation that the data directories are all fine.
valid: true,
repaired: false,
warnings: [],
errors: [],
extraIndexEntries: [],
missingIndexEntries: [],
corruptRecords: [],
ok: 1
|
|
Hey amit.gupta@opensense.com - it looks like we didn't get the file, could you please try this link instead?
Thanks for bearing with us!
|
|
This chris.kelly@mongodb.com .. george.wangensteen@mongodb.com I've uploaded sample-data.json that was extracted using mongoexport.
|
|
Hi amit.gupta@opensense.com,
Here's a new secure upload portal link for you. Files uploaded to this portal are hosted on Box, are visible only to MongoDB employees, and are routinely deleted after some time.
|
|
No prob george.wangensteen@mongodb.com .. I tried uploading a sample to the upload portal, but the link is expired. Could a new one be generated?
|
|
amit.gupta@opensense.com - I will let eric.sedor@mongodb.com answer you question about Validate. And I appreciate your willingness to save the datafiles. I also wanted to also ask if you would be able to provide a "sample document" that shows the schema and types of data being upserted - I'm continuing to think about what could explain the string-encoded binary I found in the core dump that I mentioned above, and knowing what the user-data looks like may be helpful in determining this. If you don't want to share it on the ticket, another option would be uploading a text file to the secure upload link you've been using. Thanks again, George
|
|
Thanks eric.sedor@mongodb.com. I'm currently running the replicaSet maintenance operation on the secondary nodes 1-by-1. I'm currently running "validate({full: true})" on the first node. Once they are all complete, I'll put a quick note here on the findings. I'm curious.. If validate({full: true}) comes back clean, does that provide solid confirmation that the data files are free from corruption? I'm hoping validate({full: true}) comes back with errors 
I'm happy to also provide the datafiles /var/lib/mongo/ folder. The files total about 160GB and this particular collection shouldn't have anything confidential in it. The next time we have a crash, I will copy the corrupted data files to /var/lib/mongo.corrupted so they will be available for analysis in case you want them.
I'll hold off on changing any log settings until you think we should.
|
|
Hi amit.gupta@opensense.com, this is Eric chiming in again.
If I run the logical sync over the network, would that effectively accomplish the maintenance requirements? ie, does a logical sync validate all the data coming in?
A logical sync ensures a new and consistent file structure. If the sync source is free of any previous corruption, this ensures a clean set of data. However, if the sync source has structural issues then it's conceivable that those structural issues prevent the sync from copying all data correctly. But, in either case, the newly synced node is expected to be internally consistent at the time of the sync.
I make a distinction here because, due to what you've reported so far I'm not sure what we can guarantee with respect to re-using data files over time. And, our best shot at diagnosing the issue is to start from known good points and have complete reports of the subsequent failure (including potentially the data files themselves).
We are also looking at maybe increasing logging verbosity
In the context of the above, I am inclined to recommend focusing on the diagnostics we've asked for already (mongod logs, diagnostic.data directory, core dump(s) for crashes, stack traces for hung processes) rather than suggesting any change to log verbosity. It's not yet clear which log verbosity components will be most interesting, and we don't want to suggest overlogging yet.
|
|
Thanks George! Will do.. The reason I'm starting again clean is we have a script that automatically tries to restart mongod in case of crash.. I just disabled that in case it is producing a different wired tiger crash on restart. So just trying to start totally clean. If I run the logical sync over the network, would that effectively accomplish the maintenance requirements? ie, does a logical sync validate all the data coming in?
We are also looking at maybe increasing logging verbosity for the https://www.mongodb.com/docs/manual/reference/configuration-options/#mongodb-setting-systemLog.component.storage.wt.verbosity
If you think think we should do any others, please let me know. (Or if you think we should not do this as well).
|
|
Hi amit.gupta@opensense.com ,
I'm sorry to hear we don't have log output for the latest crashes. I'm hoping that next time we can. I just wanted to call out - when you create the directory, please also ensure that it is writeable by the mongod processes, to ensure we collect the appropriate log and ASAN output. Verifying this with the verbose ASAN option we discussed above may also help.
I understand that we have limited data this time and will likely need to wait until we have more data. However, I noticed from the stack trace you posted that this crash appears to be due to a WiredTiger panic. As such, I wanted to reiterate the advice for this case from my above comment (copied and pasted from the above):
This error and its associated but partial logging information [edit: can be, since this is a different panic] indicative of data file corruption, and could conceivably be caused by memory corruption. I understand you've re-created nodes at various points during this long-running troubleshooting process and that specific attention to this issue may not be needed. We're covering this topic to:
- Request that if the WiredTiger library panic occurs again, that, prior to re-creating its data via logical initial sync, you:
-
- Store a copy of the logs produced by mongodb during validation as well as the output of the validate({{
{full:true}
) command.
-
- Store a copy of the dbPath for our possible inspection
(end copy and paste).
I know you mentioned that "Lately [you] have been recovering from local copies of the /var/lib/mongo directory since it is much faster to recover". If it's not to late, before doing this, I would advise against this action in favor of a logical initial sync, for the reasons above. In addition, if possible, I would recommend running the validate command and storing all output as part of the replica-set-maintenance process as described above and in the linked document, and uploading the results to this ticket. Saving a copy of the dbpath may be helpful as well, if possible.
We need more data to assess the particular cause of WiredTiger panics, so saving this information along with the core dumps is important when you see this error in the stack trace.
Like you said, we may need to wait for more data due to the missing information, but saving and uploading this information (along with the core dumps) from this incident to Amphora is still very useful to us.
Best,
George
|
|
For some reason the /mongodb/var/logs/mongod directory was not on my servers. Most likely my screw up. I've created that directory now. I've also increased the /etc/systemd/coredump.conf settings to increase limits to 60GB for coredump sizes (though the secondary coredumps are much smaller).
Starting fresh with everything now.. So ignore the 3 crashes I just reported (unless you think their backtraces are useful). Will report back when it happens again with all the data.
|
|
just got 3 crashes in short succession of each other.. in the process of recovering.. It's strange because i don't even seem to have the /mongodb directory that is supposed to house the asan.log. will investigate that.. here is a back trace from one of the server in case it's helpful. I'm going to gather the coredumps in the meantime.
One of them is truncated to 20GB. I believe to increase the limit we need to change this file /etc/systemd/coredump.conf
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
|
--Type <RET> for more, q to quit, c to continue without paging--c
|
Core was generated by `/usr/bin/mongod --config /etc/mongod.conf'.
|
Program terminated with signal SIGABRT, Aborted.
|
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140583295649344) at ./nptl/pthread_kill.c:44
|
44 ./nptl/pthread_kill.c: No such file or directory.
|
[Current thread is 1 (Thread 0x7fdc19660640 (LWP 797120))]
|
(gdb) bt
|
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140583295649344) at ./nptl/pthread_kill.c:44
|
#1 __pthread_kill_internal (signo=6, threadid=140583295649344) at ./nptl/pthread_kill.c:78
|
#2 __GI___pthread_kill (threadid=140583295649344, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
|
#3 0x00007fdc2c05b476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
|
#4 0x00007fdc2c0417f3 in __GI_abort () at ./stdlib/abort.c:79
|
#5 0x0000557fcab8cbcc in mongo::fassertFailedWithLocation (msgid=<optimized out>,
|
file=0x557fbca38300 <str> "src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp", line=712) at src/mongo/util/assert_util.cpp:200
|
#6 0x0000557fc12ac96b in mongo::fassertWithLocation (msgid=50853, file=0xc29c0 <error: Cannot access memory at address 0xc29c0>, line=712,
|
testOK=<optimized out>) at src/mongo/util/assert_util.h:307
|
#7 mongo::(anonymous namespace)::mdb_handle_error_with_startup_suppression (handler=<optimized out>, session=<optimized out>, errorCode=<optimized out>,
|
message=<optimized out>) at src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp:712
|
#8 0x0000557fc179d0cc in __eventv (session=0x7fdc2394eb78, is_json=false, error=-31804,
|
func=0x557fbca57fe0 <__PRETTY_FUNCTION__.__wt_block_checkpoint_resolve> "int __wt_block_checkpoint_resolve(WT_SESSION_IMPL *, WT_BLOCK *, _Bool)",
|
line=929, category=426097408, level=<optimized out>, fmt=<optimized out>, ap=<optimized out>) at src/third_party/wiredtiger/src/support/err.c:433
|
#9 0x0000557fc179ec80 in __wt_panic_func (session=<optimized out>, error=22,
|
func=0x557fbca57fe0 <__PRETTY_FUNCTION__.__wt_block_checkpoint_resolve> "int __wt_block_checkpoint_resolve(WT_SESSION_IMPL *, WT_BLOCK *, _Bool)",
|
line=929, category=WT_VERB_DEFAULT, fmt=<optimized out>) at src/third_party/wiredtiger/src/support/err.c:534
|
#10 0x0000557fc1336fa9 in __wt_block_checkpoint_resolve (session=0x7fdc2394eb78, block=0x61c000019080, failed=<optimized out>)
|
at src/third_party/wiredtiger/src/block/block_ckpt.c:919
|
#11 0x0000557fc15b76ea in __meta_track_unroll (session=<optimized out>, trk=<optimized out>) at src/third_party/wiredtiger/src/meta/meta_track.c:180
|
#12 __wt_meta_track_off (session=0x7fdc2394eb78, need_sync=<optimized out>, unroll=<optimized out>) at src/third_party/wiredtiger/src/meta/meta_track.c:305
|
#13 0x0000557fc1870a61 in __txn_checkpoint (session=<optimized out>, cfg=<optimized out>) at src/third_party/wiredtiger/src/txn/txn_ckpt.c:1118
|
#14 __txn_checkpoint_wrapper (session=0x7fdc2394eb78, cfg=0x7fdc1965dee0) at src/third_party/wiredtiger/src/txn/txn_ckpt.c:1181
|
#15 0x0000557fc186efe9 in __wt_txn_checkpoint (session=0x7fdc2394eb78, cfg=0x7fdc1965dee0, waiting=<optimized out>)
|
at src/third_party/wiredtiger/src/txn/txn_ckpt.c:1244
|
#16 0x0000557fc172bb2d in __session_checkpoint (wt_session=0x7fdc2394eb78, config=0x557fbca25e80 <str> "use_timestamp=true")
|
at src/third_party/wiredtiger/src/session/session_api.c:1944
|
#17 0x0000557fc11904cf in mongo::WiredTigerKVEngine::_checkpoint (this=0x61b000002a80, session=0x7fdc2394eb78)
|
at src/mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp:2073
|
#18 0x0000557fc11b0cb9 in mongo::WiredTigerKVEngine::checkpoint (this=0x61b000002a80) at src/mongo/db/storage/wiredtiger/wiredtiger_kv_engine.cpp:2090
|
#19 0x0000557fc3472c9c in mongo::Checkpointer::run (this=0x6120000151c0) at src/mongo/db/storage/checkpointer.cpp:117
|
#20 0x0000557fca49ecb0 in mongo::BackgroundJob::jobBody (this=<optimized out>) at src/mongo/util/background.cpp:160
|
#21 0x0000557fca4a4d12 in mongo::BackgroundJob::go()::$_0::operator()() const (this=<optimized out>) at src/mongo/util/background.cpp:186
|
#22 std::__invoke_impl<void, mongo::BackgroundJob::go()::$_0>(std::__invoke_other, mongo::BackgroundJob::go()::$_0&&) (__f=...)
|
at /opt/mongodbtoolchain/revisions/11316f1e7b36f08dcdd2ad0640af18f9287876f4/stow/gcc-v3.OP4/lib/gcc/x86_64-mongodb-linux/8.5.0/../../../../include/c++/8.5.0/bits/invoke.h:60
|
#23 std::__invoke<mongo::BackgroundJob::go()::$_0>(mongo::BackgroundJob::go()::$_0&&) (__fn=...)
|
at /opt/mongodbtoolchain/revisions/11316f1e7b36f08dcdd2ad0640af18f9287876f4/stow/gcc-v3.OP4/lib/gcc/x86_64-mongodb-linux/8.5.0/../../../../include/c++/8.5.0/bits/invoke.h:95
|
#24 std::__apply_impl<mongo::BackgroundJob::go()::$_0, std::tuple<>>(mongo::BackgroundJob::go()::$_0&&, std::tuple<>&&, std::integer_sequence<unsigned long>) (
|
__f=..., __t=...)
|
at /opt/mongodbtoolchain/revisions/11316f1e7b36f08dcdd2ad0640af18f9287876f4/stow/gcc-v3.OP4/lib/gcc/x86_64-mongodb-linux/8.5.0/../../../../include/c++/8.5.0/tuple:1678
|
#25 std::apply<mongo::BackgroundJob::go()::$_0, std::tuple<> >(mongo::BackgroundJob::go()::$_0&&, std::tuple<>&&) (__f=..., __t=...)
|
at /opt/mongodbtoolchain/revisions/11316f1e7b36f08dcdd2ad0640af18f9287876f4/stow/gcc-v3.OP4/lib/gcc/x86_64-mongodb-linux/8.5.0/../../../../include/c++/8.5.0/tuple:1687
|
#26 mongo::stdx::thread::thread<mongo::BackgroundJob::go()::$_0, , 0>(mongo::BackgroundJob::go()::$_0)::{lambda()#1}::operator()() (this=<optimized out>)
|
at src/mongo/stdx/thread.h:193
|
#27 std::__invoke_impl<void, mongo::stdx::thread::thread<mongo::BackgroundJob::go()::$_0, , 0>(mongo::BackgroundJob::go()::$_0)::{lambda()#1}>(std::__invoke_oth--Type <RET> for more, q to quit, c to continue without paging--c
|
er, mongo::stdx::thread::thread<mongo::BackgroundJob::go()::$_0, , 0>(mongo::BackgroundJob::go()::$_0)::{lambda()#1}&&) (__f=...) at /opt/mongodbtoolchain/revisions/11316f1e7b36f08dcdd2ad0640af18f9287876f4/stow/gcc-v3.OP4/lib/gcc/x86_64-mongodb-linux/8.5.0/../../../../include/c++/8.5.0/bits/invoke.h:60
|
#28 std::__invoke<mongo::stdx::thread::thread<mongo::BackgroundJob::go()::$_0, , 0>(mongo::BackgroundJob::go()::$_0)::{lambda()#1}>(mongo::stdx::thread::thread<mongo::BackgroundJob::go()::$_0, , 0>(mongo::BackgroundJob::go()::$_0)::{lambda()#1}&&) (__fn=...) at /opt/mongodbtoolchain/revisions/11316f1e7b36f08dcdd2ad0640af18f9287876f4/stow/gcc-v3.OP4/lib/gcc/x86_64-mongodb-linux/8.5.0/../../../../include/c++/8.5.0/bits/invoke.h:95
|
#29 std::thread::_Invoker<std::tuple<mongo::stdx::thread::thread<mongo::BackgroundJob::go()::$_0, , 0>(mongo::BackgroundJob::go()::$_0)::{lambda()#1}> >::_M_invoke<0ul>(std::_Index_tuple<0ul>) (this=<optimized out>) at /opt/mongodbtoolchain/revisions/11316f1e7b36f08dcdd2ad0640af18f9287876f4/stow/gcc-v3.OP4/lib/gcc/x86_64-mongodb-linux/8.5.0/../../../../include/c++/8.5.0/thread:244
|
#30 std::thread::_Invoker<std::tuple<mongo::stdx::thread::thread<mongo::BackgroundJob::go()::$_0, , 0>(mongo::BackgroundJob::go()::$_0)::{lambda()#1}> >::operator()() (this=<optimized out>) at /opt/mongodbtoolchain/revisions/11316f1e7b36f08dcdd2ad0640af18f9287876f4/stow/gcc-v3.OP4/lib/gcc/x86_64-mongodb-linux/8.5.0/../../../../include/c++/8.5.0/thread:253
|
#31 std::thread::_State_impl<std::thread::_Invoker<std::tuple<mongo::stdx::thread::thread<mongo::BackgroundJob::go()::$_0, , 0>(mongo::BackgroundJob::go()::$_0)::{lambda()#1}> > >::_M_run() (this=<optimized out>) at /opt/mongodbtoolchain/revisions/11316f1e7b36f08dcdd2ad0640af18f9287876f4/stow/gcc-v3.OP4/lib/gcc/x86_64-mongodb-linux/8.5.0/../../../../include/c++/8.5.0/thread:196
|
#32 0x0000557fcaf9a17f in execute_native_thread_routine ()
|
#33 0x00007fdc2c0adb43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
|
#34 0x00007fdc2c13fa00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
|
|
|
Awesome, the asan.log.pid is generated with verbosity=1. I have deployed the ASan build to all 4 of the secondaries. I still think there is a chance we will not get the asan.log upon a crash because I think I had this config running in the past on a node that crashed.
I'm planning on just letting the ASAN build version just burn in until Friday. I will stop our gen_reports.js process (which seems to be magically preventing crashes) on Saturday and see if I can capture some crashes.
|
|
Hi amit.gupta@opensense.com,
I've determined that in this environment, we don't see output from ASAN when a signal is sent externally, as you surmised. To ensure ASAN output is being generated correctly, you can append the "verbosity=1" option to your ASAN_OPTIONS environment variable before running the exact test-script/set of instructions you posted above. So in total, your new ASAN_OPTIONS should look like:
ASAN_OPTIONS="detect_leaks=0:use_sigaltstack=0:check_initialization_order=true:strict_init_order=true:abort_on_error=1:disable_coredump=0:handle_abort=1:unmap_shadow_on_exit=1:log_path=/mongodb/var/logs/mongod/asan.log:verbosity=1"
|
You should then see some output in the asan log file patch, which will confirm ASAN is set up correctly. Once you have confirmed the setup is working, you can remove the verbosity flag before deploying the ASAN build to your cluster to reduce the amount of output.
Additionally, please keep in mind that an ASAN build will have an approximately 2x slowdown over an ordinary production build, so you may want to only deploy the ASAN build on secondary nodes.
I hope this resolves the ASAN issue and allows you to deploy the binary. I will pick up investigating once we have some core dumps from the recent crashes and/or ASAN output. Thank you!
|
|
Thanks amit.gupta@opensense.com,
I'm working on figuring out how we can ensure you have the ASAN build set up correctly.
I would expect the script you produced above to produce an asan log file (it should be named "asan.log.<pid>"), but I'm confirming that this morning. While I investigate, I would appreciate if you could double-check that the "mongodb" permissions group and "mongodb" user is correctly configured and provide that information (i.e. by showing the output of the getent and/or id commands), or checking if the behavior is different if you simply make the log directory world-writable.
Additionally, it would be helpful if you could inspect stdout and stderr immediately after killing the process, to see if there is any output present there. (Note that systemd may re-direct stdout and stderr for your service to syslog, which on Ubuntu will I believe be at /var/log/syslog; you should also be able to view the output of the mongod service by using "journalctl -u [unitfile]" where unitfile is the systemd .service name of the service running mongodb.
EDIT: a quick update. I think I've reproduced the issue of ASAN not outputting to the log file with this setup in my environment and am working on determining a fix.
|
|
George, I tried on a test db node to get an ASAN log generated using "kill -11 <mongd_pid>". I'm not sure if this is a valid way to test generate the asan.log because its not clear whether kill -11 would actually corrupt an memory for the ASAN to detect. Would you expect the below steps to generate an asan.log?
sudo apt -y install clang lldb lld
|
mkdir -p /mongodb/var/logs/mongod
|
chown -R mongodb:mongodb /mongodb/
|
ln -s /usr/bin/llvm-symbolizer-14 /usr/bin/llvm-symbolizer
|
|
wget https://mciuploads.s3.amazonaws.com/mongodb-mongo-master/ubuntu2204-asan/c9a99c120371d4d4c52cbb15dac34a36ce8d3b1d/dist/mongo-mongodb_mongo_master_ubuntu2204_asan_patch_c9a99c120371d4d4c52cbb15dac34a36ce8d3b1d_641344592fbabe080b19cb03_23_03_16_16_31_49.tgz
|
tar xvzf mongo-mongodb_mongo_master_ubuntu2204_asan_patch_c9a99c120371d4d4c52cbb15dac34a36ce8d3b1d_641344592fbabe080b19cb03_23_03_16_16_31_49.tgz
|
# added following to /etc/systemd/system/mongod.service# LimitCORE=infinitysystemctl daemon-reloadcp /usr/bin/mongod /usr/bin/mongod.orig # backup original mongod
|
systemctl stop mongod
|
cp mongodb-linux-x86_64-ubuntu2204-6.0.5-patch-641344592fbabe080b19cb03/bin/mongod /usr/bin/mongod
|
# add following to /etc/default/mongod
|
#ASAN_OPTIONS="detect_leaks=0:use_sigaltstack=0:check_initialization_order=true:strict_init_order=true:abort_on_error=1:disable_coredump=0:handle_abort=1:unmap_shadow_on_exit=1:log_path=/mongodb/var/logs/mongod/asan.log"
|
#ASAN_SYMBOLIZER_PATH=/usr/bin/llvm-symbolizer
|
systemctl start mongod
|
# verify the ASAN options are in /proc/<pid_of_mongod>/environ
|
# kill -11 <pid_of_mongod>
|
#
|
# check /mongodb/var/logs/mongod/ but no file present
|
I'm confident we will be able to send you full coredumps thanks to your instructions. But I'm also focused on getting you the ASAN logs because you mentioned that it will make it a lot easier to debug.
|
|
Sorry, one thing I also want to mention is that our systemd unit file for mongod has the following modificatio for NUMA handling:
ExecStart=/usr/bin/numactl --interleave=all /usr/bin/mongod --config /etc/mongod.conf
|
Just in case you think it may be causing problems..
|
|
Awesome, will add the OS info. FYI, all nodes are on Ubuntu 22.04.2 latest (we removed the node running Debian 11 from the environment since it was also having the issue). I'll include output of "lscpu" for the nodes as well. If any you think of any other commands you think would be helpful, feel free to let me know and I will add it to my script.
|
|
Thanks amit.gupta@opensense.com, that's very helpful. I just wanted to note that in the future with when you upload coredumps in the structure you mentioned, including the exact mongoDB version and operating system would be great too, just to avoid any miscommunication, as I know your environment has changed somewhat over the course of this ticket. If we associate the core-dump with the incorrect build of the server and/or operating system, it isn't useful to us, so it's just good to be 100% sure here.
I was looking into the mongos router issues that occurred in the past because I suspect the memory issues we saw previously in the mongos router may be related to your current issues - the fact that you have seen segmentation faults without any obvious pattern, across 5.0, 6.0.4, and 6.0.5 makes me wonder if there is an underlying issue whose particular root-cause we haven't uncovered, but exact manifestation of which changes from version-to-version. And I appreciate all of the information you have provided and the call-out about connections. I'll keep my eyes open and hope I can provide more support once we have either ASAN output or usable core-dumps of the current issue on 6.0.5.
|
|
Thanks George, I really appreciate you looking at this. I just wanted to clarify a coupe things:
- we are no longer experiencing the mongos router issues since we upgraded from 6.0.4 to 6.0.5. (This ticket has a lot of stuff near the beginning that is probably not applicable anymore). Do you think it may be easier to start a new ticket so we can start fresh with the current issues? I'm happy to start fresh if you want to.
- Regarding the question of the initial sync methodology. Up until last week, our initial syncs were done via deleting the data directory and then letting mongod resync the data via the network and rebuild the indexes (this would take about 1.5 hours). Lately we have been recovering from local copies of the /var/lib/mongo directory since it is much faster to recover (2-5 minutes instead of 1.5 hours). We are using MongoDB CE 6.0.5
- We will go through the process of replicaset maintenance that you linked to. As background, today we ran "mongod --repair /var/lib/mongo" on a secondary node, and it found no errors. But we will go through the maintenance steps you outlined as well.
- We have no other processes that could be interfering with mongod. The next time a crash happens, I will send the output of 'ps -ef' as well on the server. I think the previous coredumps were getting truncated.
- I will put the ASAN build on all nodes of the replicaset of this cluster today as well as update the systemd-coredump settings as you mentioned.
- The wiredTiger coredumps may be redherring and caused by data corruption caused by the segmentation fault. ie, after the segfault occurs, sometimes mongod can start back cleanly.. sometimes it cant' start cleanly and core dumps complaining about WireTiger files.
Once again, thank you again for looking at this. Going forward, I will try to get clearer/better coredumps and all the associated diagnostic data and log files from all the nodes. I'll structure them in a folder structure like this:
/host1-secondary/mongod.log & diagnostic.data
|
/host2-secondary-crashed/mongod.log & diagnostic.data
|
/host3-primary/mongod.log & diagnostic.data
|
etc etc
One thing I wanted to put in the back of your mind is if this could possibly be due to a bug in connection handling. Particularly in situations where inactive connections timeout.
|
|
Hey amit.gupta@opensense.com,
I've spent some time reviewing the issues you've been experiencing and the data provided so far in some detail.
After reviewing the timeline of reported issues affecting this cluster, I'd like to focus in two areas, which I think cover most/all of the issues you've faced:
The crash due to "WiredTiger library panic" at 2023-03-31T14:55:09.791-04:00
This error and its associated but partial logging information is almost always indicative of data file corruption, but could conceivably be caused by memory corruption. I understand you've re-created nodes at various points during this long-running troubleshooting process and that specific attention to this issue may not be needed. We're covering this topic to:
- Recommend that you consider the process by which you create replica set nodes. Is this always by logical initial sync, or are you doing something more exotic such as copying dbPaths? I know before you mentioned doing a "full sync" via initial sync in the past, but wanted to triple-check. Can you also confirm you are using community and not enterprise MongoDB binaries? The initial sync methodology between the two differs, so this confirmation is important.
- Suggest that if after consideration you have any doubt about the provenance of the data files currently backing any of your nodes, we suggest running validate({{{full:true}) on each node as part of a replica set maintenance process.
- Request that if the WiredTiger library panic occurs again, that, prior to re-creating its data via logical initial sync, you:
- perform a validate({{{full:true}) on the crashed node (by starting it in standalone mode according to the replica set maintenance process.
- Store a copy of the logs produced by mongodb during validation as well as the output of the validate({{{full:true}) command.
- Store a copy of the dbPath for our possible inspection
While I know this hasn't been the predominant issue you've faced, we wanted to provide this information because it is very unusual, and you experienced a similar issue before reported in a separate occurrence in SERVER-73805.
The second area I'd like to cover is the large amount of segmentation faults/crashes you have faced.
We have been able to examine the core dump from a mongos segmentation fault that you provided on March 10th (the other mongos core-dump you provided is not interpretable/appears damaged). The core dump has some strange memory contents that we cannot currently explain from MongoDB behavior. The segmentation fault in this instance is caused because memory in the MongoDB process space appears to be covered in an ASCII binary string (i.e. ASCII encoded 1s and 0s), lasting for 10,000 bytes and beginning at a page-aligned address. We will continue to investigate this issue to determine if the root cause could be MongoDB behavior, but we would advise that you ensure no other processes are running on the machine that could interfere with MongoDB’s address space (we have seen similar issues in the past caused by the presence of auditware like Guardium). If it’s possible to check the memory hardware used on your systems for any issues, we would recommend that as well.
We have also thoroughly re-examined the logs from each segmentation fault and crash. Unfortunately, we are unable to determine any sort of pattern at this time as to what may be the cause of the invalid memory accesses. The evidence I mentioned above from the core dump leads me to believe that we will need more than just logs and backtraces to find the root-cause of this memory issue; we will need to inspect the program’s memory manually to determine if there is any pattern to the invalid accesses. At this time I have been unable to open any of the other core dumps you provided; it is imperative that when crashes occur we collect full core dumps, along with precise information about the MongoDB build and operating system, that allow us to inspect the memory space further (See generic core dump instructions below for a process that we've used before). Additionally, the ASAN-instrumented build we provided will help provide diagnostics at run-time as to potential memory misuse and issues.
Lastly, I wanted to acknowledge that the crashes on 6.0.5 appear to have different backtraces related to the allocator. However, these segmentation faults are generally evidence of a memory-management problem elsewhere that is only “hit” inside the allocator, so core dumps are important to help us analyze this problem as well.
Ultimately, we want to enumerate what we believe we need for each incident going forward to have the best chance of diagnosing what you're seeing:
- A clear description of what's observed:
- For example, If the issue is a "stuck process": what errors are being thrown by the application? Is systemd reporting the mongod as up? Is the mongod process still logging?
- Mongod logs from all replica set members covering the timeframe of the issue. (We're interested in all nodes logs because of the proximate crash times of multiple replica set members)
- The diagnostic.data directory in the dbPath of each replica set member (Again, due the apparent impact on multiple nodes)
- When the incident is a stuck process, collect stack traces prior to reacting to the stuck process by running a simple script for a minute or so, to prompt mongod to output stack trace samples to the process's mongod.log file:
while ...; do
kill -SIGUSR2 -p ...
sleep 1
done
- When the incident is a crash, a core dump from the node that crashed:
- This requires running with core dumps enabled (see suggested instructions below)
- Ideally you would also be running an ASAN instrumented build, as logs from the ASAN build dramatically improve our ability to trace the use of memory in cases of seemingly patternless crashes. We aren't sure how far we can progress the investigation without this information.
Generic core dump instructions:
- Install the systemd core-dump utility with apt install systemd-coredump
- edit /lib/systemd/system/mongod.service and add LimitCORE=infinity
- run systemctl daemon-reload
- restart mongod (systemctl start mongod)
- Test the generation of a core dump using kill -11 <mongod pid>
- run coredumpctl to see the resulting coredump, identified by the pid at the time
- extract the core dump to a file with coredumpctl --output mongo.core dump <mongod pid>
I hope these instructions will help us get correct core dumps to investigate the future crashes/segmentation faults, which will be essential moving forward here. The existing upload portal should work to provide them to us. Thanks, and we appreciate your help.
|
|
Another potentially interesting fact. For some reason mongocluster3 (the shard that we have all the problems with), does not have any jumbo chunks while other shards do. I would have expected there to be some more even distribution of them.
[direct: mongos] config> db.chunks.find({shard: 'mongocluster', 'jumbo': true}).count()
|
348
|
[direct: mongos] config> db.chunks.find({shard: 'mongocluster2', 'jumbo': true}).count()
|
54
|
[direct: mongos] config> db.chunks.find({shard: 'mongocluster3', 'jumbo': true}).count()
|
0
|
|
|
Also, fyi, I just ran "mongod --repair" against one of the backup nodes that I took out of the cluster, and no errors were found
{"t":{"$date":"2023-05-02T16:48:45.331-04:00"},"s":"I", "c":"INDEX", "id":20306, "ctx":"initandlisten","msg":"Validation complete for collection. No corruption found","attr":{"namespace":"test_db.testactivity","uuid":{"uuid":{"$uuid":"f0574254-d3dd-4894-bbe3-faf365e87af2"}}}}
|
....
|
{"t":{"$date":"2023-05-02T16:48:45.332-04:00"},"s":"I", "c":"STORAGE", "id":4934002, "ctx":"initandlisten","msg":"Validate did not make any repairs","attr":{"collection":"test_db.testactivity"}}
|
Another fyi, we just had a crash on one of the secondaries in the 3-node RS. So reducing the RS members seems to have no affect.
Interesting fact #2, when the crash happened, it looks like the report generation process was sleeping for at least 10 minutes (maybe more). We are making updates to make the report generation process not sleep and continue querying..
|
|
The nodes all recovered from backup and become secondaries for a short while.. but look like they have recrashed. It may sound crazy, but I am going to try making them 3-node replicasets instead of 5-node RSs..
|
|
Here are some backtraces from coredumpctl debug (coredumpctl seems to indicate the coredump is truncated, so I'm trying to see how to make it not truncate). It looks like both of them are trying to access memory at the same exact address "0x7f4986e46258".
on daldb3:
(gdb) bt
|
#0 0x000056396aeae18b in tc_newarray ()
|
#1 0x000056396ad94bc6 in boost::log::v2s_mt_posix::attribute_set::attribute_set() ()
|
#2 0x000056396ac6572a in mongo::logv2::LogSource::LogSource(mongo::logv2::LogDomain::Internal const*, bool) ()
|
#3 0x000056396ac5ff18 in mongo::logv2::LogDomainGlobal::Impl::source() ()
|
#4 0x000056396ac59f0c in mongo::logv2::detail::doLogImpl(int, mongo::logv2::LogSeverity const&, mongo::logv2::LogOptions const&, mongo::StringData, mongo::logv2::TypeErasedAttributeStorage const&) ()
|
#5 0x000056396ac886a0 in mongo::(anonymous namespace)::writeMallocFreeStreamToLog() ()
|
#6 0x000056396ac88a14 in abruptQuitWithAddrSignal ()
|
#7 <signal handler called>
|
#8 0x000056396aeae18b in tc_newarray ()
|
Backtrace stopped: Cannot access memory at address 0x7fd58243e258
|
on ashdb3:
(gdb) bt
|
#0 0x0000559fe633b18b in tc_newarray ()
|
#1 0x0000559fe6221bc6 in boost::log::v2s_mt_posix::attribute_set::attribute_set() ()
|
#2 0x0000559fe60f272a in mongo::logv2::LogSource::LogSource(mongo::logv2::LogDomain::Internal const*, bool) ()
|
#3 0x0000559fe60ecf18 in mongo::logv2::LogDomainGlobal::Impl::source() ()
|
#4 0x0000559fe60e6f0c in mongo::logv2::detail::doLogImpl(int, mongo::logv2::LogSeverity const&, mongo::logv2::LogOptions const&, mongo::StringData, mongo::logv2::TypeErasedAttributeStorage const&) ()
|
#5 0x0000559fe61156a0 in mongo::(anonymous namespace)::writeMallocFreeStreamToLog() ()
|
#6 0x0000559fe6115a14 in abruptQuitWithAddrSignal ()
|
#7 <signal handler called>
|
#8 0x0000559fe633b18b in tc_newarray ()
|
Backtrace stopped: Cannot access memory at address 0x7f4986e46258
|
|
|
We had another 3 nodes in the 5node replica set all crash in within minutes of each other. All there data files were got corrupted during the crash so needed to restore from backup folders and restart.
|
|
The report generation process is a simple Node.js/mongoose app that queries the db with "secondaryPreferred" pretty aggressively.
|
|
Couple quick updates that may be interesting. We were able reduce(or almost eliminate) the crashes by starting one of our report generation processes that does heavy queries to our secondary nodes. This process was turned off by accident (and unknown to us) last week before the crashes started occurring. I only noticed that the report generation process was off because I was trying to look for clues in it's log file to see if it is somehow causing the crash.
After turning on this process, the crashes almost completely stopped. I say almost completely stopped because it also seems to be related to the latency of the connection of the process to the DB shards.
So now our theory is that somehow the crashes actually happen more often when the number of connections is low and/or declining.
|
|
Hi George. Yes, this ticket has got a lot of history in it that is probably no longer relavant since our upgrade to 6.0.5 fixed much of it.
We do not have any more problems with mongos hanging. mongocluster3 is indeed the only shard that is experiencing problems now. We do not have any issues with mongocluster or mongocluster2. It seems like there is something special about the data/chunks/queries in mongocluster3 that make it unstable.
Our current state is that all of our crashes are happening with members of mongocluster3.
We have about 20 app servers that run mongos which connect to this cluster. The majority of queries comes from about 8 of these 20 servers. They are geographically distributed throughout the US and and some in Europe.
We have a handful of queries that have the secondaryPreferred using mongoose. eg:
let retdata = Activity.find(filter, projections).read('secondaryPreferred').lean().cursor();
|
Our queries are actually quite basic IMO. I try to keep things extremely simple.
Here is some chunk metadata to show you the distribution across the shards (it seems like mongocluster3 has more chunks for whatever reason):
chunkMetadata: [
|
{ shard: 'mongocluster', nChunks: 10782 },
|
{ shard: 'mongocluster2', nChunks: 10971 },
|
{ shard: 'mongocluster3', nChunks: 12023 }
|
],
|
|
|
Regarding workload related questions:
- The majority of the workload is writing to a collection called "activity".. This may get millions of writes per day. There are also upserts that may happen on these documents frequently.
- There are queries against secondaries on this collection related to some reporting and trending
- No code has changed in our environment that would change the query behavior.
- The only thing that changed with our hardware when we were having issues with 5.0 was we added 2 more nodes to the replicaset. previously we had 3-node replica sets, but we upgraded to 5 node sets because of the outages.
- Many problems we were previously facing were fixed when we upgrade from 6.0.4 to 6.0.5.
- It's unclear why these issues have started popping up more recently (todays new issues only started popping up after running the ASAN build on a couple nodes. I've just finished reverting the ASAN builds for now to eliminate that as the cause for the new crashes today).
- The only thing that may have changed over the last couple of weeks is adding progressively more activity as we add customers.
We upgraded our entire infrastructure from Centos7/8 to Ubuntu Jammy starting late last year and completed earlier this year. Along with that upgrade we upgraded our Mongo from 4.2 (I believe) to 5.0 then fairly quickly to 6.0. We were experiencing a lot more issues with 5.0.
I'm also happy to hop on a screen share to answer any/all questions and show you the environment and run any commands you think would be good to run.
Our original crashes about 5 weeks ago seemed to be focused on 2 servers in the replicaSet. pndb3 and ashdb3. We rebuilt ashdb3 to use debian 11 just to make sure it had nothing to do with jammy. We've rebuilt ashdb3 back to jammy since the crashes happened on Debian11 as well.
|
|
Thanks amit.gupta@opensense.com. It's unfortunate that we don't have the ASAN data - let's make sure we nail down the configuration before you re-deploy the ASAN build to ensure that the output is collected.
Thanks for the details on your workloads! I know you mentioned you'lll follow up further with uploads/more information when you have time, but I wanted to squeeze in these questions based on what you said:
When you say "mongocluster3 is the shard we're having problems with" - have all the mongod crashes occured on that shard? From my understanding of the logs and post history, several of the crashes occured on different shards. What makes you say that is the one you're having the problem with? Additionally, I would appreciate some insight as to what secondary reads this shard is receiving, as well as the layout of mongos routers in your cluster. Thanks again.
|
|
Hi George, thanks for helping with this. Will give you a longer writeup answering your questions, but wanted to start with the ASAN config in this message.
ASAN_OPTIONS="detect_leaks=0:use_sigaltstack=0:check_initialization_order=true:strict_init_order=true:abort_on_error=1:disable_coredump=0:handle_abort=1:unmap_shadow_on_exit=1:log_path=/mongodb/var/logs/mongod/asan.log"
|
ASAN_SYMBOLIZER_PATH=/usr/bin/llvm-symbolizer
|
The /mongodb/var/logs/mongod/ is world writable
ls -al /mongodb/var/logs/mongod/
|
total 0
|
drwxrwxrwx 2 mongodb mongodb 6 Apr 27 16:05 .
|
drwxrwxrwx 3 mongodb mongodb 20 Apr 26 09:51 ..
|
As you can see, no file got put in there. Before we fixed up the ASAN_SYMBOLIZER_PATH=/usr/bin/llvm-symbolizer, I did see asan.log files get put in there, so I'm pretty sure the asan configs are corrrect. It's a mystery why this last crash didn't put a log there.
However, I'm going to temporary revert the ASAN deployments today. Just to rule it out as the reason for the increased crashes we are experiencing today. We haven't had it this bad before and the crashes look different.
=== ARCHITECTURE ===
3-shades. Each shard is a 5-node replica-set. Each node of the replicaset is in a different data center on a server with dual NVME software RAID with XFS. Each server has 128GB of RAM.
Ubuntu 22.04 Jammy. We tried Debian 11, but also experienced the same problem.
== SHARDS AND TYPES OF DATA ==
mongocluster: stores application configuration. also stores activity data.
mongocluster2: stores activity data only
mongocluster3: store activity data only
mongocluster3 is the shard we are having problems with. There are lot of writes happening to it constantly. Many reads are happening from secondaries. It seems like the secondaries are the ones that are crashing.
We seem like we are getting different stack traces today.. Somewhat related. As I am typing this message, I just got a crash on one of the nodes with this stack trace:
#0 0x000055557c7d270b in tc_malloc ()
|
#1 0x000055557c6b8b69 in boost::log::v2s_mt_posix::attribute::impl::operator new(unsigned long) ()
|
#2 0x000055557c57e235 in mongo::logv2::detail::doLogImpl(int, mongo::logv2::LogSeverity const&, mongo::logv2::LogOptions const&, mongo::StringData, mongo::logv2::TypeErasedAttributeStorage const&) ()
|
#3 0x000055557c5ac6a0 in mongo::(anonymous namespace)::writeMallocFreeStreamToLog() ()
|
#4 0x000055557c5aca14 in abruptQuitWithAddrSignal ()
|
#5 <signal handler called>
|
#6 0x000055557c7d218b in tc_newarray ()
|
#7 0x000055557c27d078 in asio::asio_handler_allocate(unsigned long, ...) ()
|
#8 0x000055557bd5f7cb in mongo::Future<void> mongo::transport::TransportLayerASIO::ASIOSession::opportunisticRead<asio::basic_stream_socket<asio::generic::stream_protocol>, asio::mutable_buffers_1>(asio::basic_stream_socket<asio::generic::stream_protocol>&, asio::mutable_buffers_1 const&, std::shared_ptr<mongo::Baton> const&) ()
|
#9 0x000055557bd6d55a in mongo::Future<void> mongo::transport::TransportLayerASIO::ASIOSession::read<asio::mutable_buffers_1>(asio::mutable_buffers_1 const&, std::shared_ptr<mongo::Baton> const&) ()
|
#10 0x000055557bd561e7 in mongo::transport::TransportLayerASIO::ASIOSession::sourceMessageImpl(std::shared_ptr<mongo::Baton> const&) ()
|
#11 0x000055557bd566db in mongo::transport::TransportLayerASIO::ASIOSession::asyncSourceMessage(std::shared_ptr<mongo::Baton> const&) ()
|
#12 0x000055557bd78f4e in mongo::AsyncDBClient::_waitForResponse(boost::optional<int>, std::shared_ptr<mongo::Baton> const&) ()
|
#13 0x000055557bd793f2 in mongo::AsyncDBClient::_continueReceiveExhaustResponse(mongo::ClockSource::StopWatch, boost::optional<int>, std::shared_ptr<mongo::Baton> const&) ()
|
#14 0x000055557bd798b7 in mongo::AsyncDBClient::awaitExhaustCommand(std::shared_ptr<mongo::Baton> const&) ()
|
#15 0x000055557bd18f8b in mongo::executor::NetworkInterfaceTL::ExhaustCommandState::continueExhaustRequest(std::shared_ptr<mongo::executor::NetworkInterfaceTL::RequestState>, mongo::StatusWith<mongo::executor::RemoteCommandResponse>) ()
|
#16 0x000055557bd198bf in mongo::unique_function<void (mongo::Status)>::makeImpl<mongo::ExecutorFuture<mongo::executor::RemoteCommandResponse>::getAsync<mongo::executor::NetworkInterfaceTL::ExhaustCommandState::continueExhaustRequest(std::shared_ptr<mongo::executor::NetworkInterfaceTL::RequestState>, mongo::StatusWith<mongo::executor::RemoteCommandResponse>)::{lambda(mongo::StatusWith<mongo::executor::RemoteCommandResponse>)#7}, 0>(mongo::executor::NetworkInterfaceTL::ExhaustCommandState::continueExhaustRequest(std::shared_ptr<mongo::executor::NetworkInterfaceTL::RequestState>, mongo::StatusWith<mongo::executor::RemoteCommandResponse>)::{lambda(mongo::StatusWith<mongo::executor::RemoteCommandResponse>)#7}&&) &&::{lambda(mongo::StatusWith<mongo::executor::RemoteCommandResponse>)#1}::operator()(mongo::StatusWith<mongo::executor::RemoteCommandResponse>)::{lambda(mongo::Status)#1}>(mongo::executor::NetworkInterfaceTL::ExhaustCommandState::continueExhaustRequest(std::shared_ptr<mongo::executor::NetworkInterfaceTL::RequestState>, mongo::StatusWith<mongo::executor::RemoteCommandResponse>)::{lambda(mongo::StatusWith<mongo::executor::RemoteCommandResponse>)#7}&&)::SpecificImpl::call(mongo::Status&&) ()
|
#17 0x000055557bd73e7c in mongo::unique_function<void (mongo::Status)>::makeImpl<mongo::ExecutorStats::wrapTask(mongo::unique_function<void (mongo::Status)>&&)::{lambda(mongo::Status)#1}>(mongo::ExecutorStats::wrapTask(mongo::unique_function<void (mongo::Status)>&&)::{lambda(mongo::Status)#1}&&)::SpecificImpl::call(mongo::Status&&) ()
|
#18 0x000055557bd43899 in asio::detail::executor_op<asio::detail::work_dispatcher<mongo::transport::TransportLayerASIO::ASIOReactor::schedule(mongo::unique_function<void (mongo::Status)>)::{lambda()#1}>, std::allocator<void>, asio::detail::scheduler_operation>::do_complete(void*, asio::detail::scheduler_operation*, std::error_code const&, unsigned long) ()
|
#19 0x000055557c281c74 in asio::detail::scheduler::do_run_one(asio::detail::conditionally_enabled_mutex::scoped_lock&, asio::detail::scheduler_thread_info&, std::error_code const&) ()
|
#20 0x000055557c281f05 in asio::detail::scheduler::run(std::error_code&) ()
|
#21 0x000055557c289a8e in asio::io_context::run() ()
|
#22 0x000055557bd39870 in mongo::transport::TransportLayerASIO::ASIOReactor::run() ()
|
#23 0x000055557bd171e7 in mongo::executor::NetworkInterfaceTL::_run() ()
|
#24 0x000055557bd1757e in std::thread::_State_impl<std::thread::_Invoker<std::tuple<mongo::stdx::thread::thread<mongo::executor::NetworkInterfaceTL::startup()::{lambda()#1}, , 0>(mongo::executor::NetworkInterfaceTL::startup()::{lambda()#1})::{lambda()#1}> > >::_M_run() ()
|
#25 0x000055557c753c5f in execute_native_thread_routine ()
|
#26 0x00007f118622eb43 in start_thread (arg=<optimized out>) at ./nptl/pthread_create.c:442
|
#27 0x00007f11862c0a00 in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
|
|
|
Hi amit.gupta@opensense.com . Thanks again for your continued help here. We are continuing our analysis of the various crashes/incidents and the data you've provided corresponding to them. To follow up on the crashes you experienced today, would you be able to provide us with the full mongod.log files, as well as any diagnostic.data and coredumps? This will allow us to investigate the stacktrace-snippets you posted further from these crashes, which all seem to show a segmentation fault inside tcmalloc/our allocater as you mentioned. The upload link you have been using should work.
Additionally, we would really love to get the mongod.log files, associated data, and ASAN output from the special build you are running that experienced an issue (thanks again for running this build!). You mentioned that there is a coredump for this crash; that would be helpful. You also mentioned that you couldn't find the log data. When chris.kelly@mongodb.com provided the instrumented build above, he included some special instructions about setting certain ASAN_OPTIONS as an environment variable before starting the process. This included making sure you set the "log_path=/mongodb/var/logs/mongod/asan.log" in this environment variable, as well as making sure the directory /mongodb/var/logs/ existed and that you had write permissions to it. Can you confirm that these steps were followed when using the instrumented build, and if so, could you let me know what contents are in that directory? Can you double check that the directory has write permissions?
Lastly, it would be really helpful for us to learn a bit more about your environment, workload, and their respective histories. Could you please tell me the topology of your deployment (i.e. how many shards, mongos routers, mongod nodes-per-shard; what processes might be colocated/what hardware each process is running on)? Additionally, could you tell me how long this deployment has been running, what applications have been using it, and when in the history of this deployment you first saw issues (I know you previously reported similar issues when the deployment was running 5.0 in SERVER-73597)? It would be helpful for us to know about this deployment's history, including the hardware environment it has run in and any changes to it. And finally, we would appreciate some more information about your workload. Could you provide a short summary, including what sort of queries (updates, reads, writes) are done most frequently, and any recent changes in the workload? In particular, we're interested in changes that may have occurred near the start of the incident, and most curious about any that may have occurred in the past few days, when issues started occurring again on multiple machines after 3 weeks of uptime.
Thank you again for working with us and we hope to have more information for you soon.
|
|
just a quicknote.. daldb3 also just crashed. something to do with tc_array. restarted it clean. I will plan to put the ASAN build here as well.
Here is coredump info. This server is NOT running the ASAN build:
PID: 8465 (mongod)
|
UID: 122 (mongodb)
|
GID: 129 (mongodb)
|
Signal: 11 (SEGV)
|
Timestamp: Fri 2023-04-28 13:39:59 EDT (7min ago)
|
Command Line: /usr/bin/mongod --config /etc/mongod.conf
|
Executable: /usr/bin/mongod
|
Control Group: /system.slice/mongod.service
|
Unit: mongod.service
|
Slice: system.slice
|
Boot ID: 68b6b88ee05e4a44a40c0908547eaaf5
|
Machine ID: 4768c7c8c6d74e5f8567a0fdc1655407
|
Storage: /var/lib/systemd/coredump/core.mongod.122.68b6b88ee05e4a44a40c0908547eaaf5.8465.1682703599000000.zst (truncated)
|
Disk Size: 599.5M
|
Message: Process 8465 (mongod) of user 122 dumped core.
|
Found module /usr/bin/mongod with build-id: 3c976b679319b912cd7f7ac36b539dcd7f937343
|
Stack trace of thread 8484:
|
#0 0x000055fc0e8a418b tc_newarray (/usr/bin/mongod + 0x503418b)
|
Stack trace of thread 8502:
|
#0 0x00007fad2051dfde n/a (n/a + 0x0)
|
|
|
Hi, I just had 2 more db's crash:
- vindb3 which has the same tc_malloc error directly above. restarted mongod and it's running again
- pndb3 which has the ASAN build.. However I do not see any file /mongodb/var/logs/mongod
What should I be giving you from pndb3? There is a coredump in /var/lib/systemd/coredump/core.mongod.122.cee3455bfc22496e9cfca7e623f8ef1c.247579.1682700542000000.zst
|
|
Thanks for the additional information amit.gupta@opensense.com . I'm going to take a look at all of the data you sent and will make sure we get back to you after some analysis. We really appreciate your willingness to use an instrumented build and your helpfulness in providing us with debugging information.
|
|
I am putting the debug build on ashdb3 now. So then ashdb3 and pndb3 will have the debug builds.
|
|
We had 2 servers seg fault within a minute of each other today, ashdb3 and vindb3. The debug version is running on pndb3. The strange thing is that it looks like a totally different stack trace and issue. I was able to restart the service without deleting data files or resyncing. Here is the stack trace on both machines:
Core was generated by `/usr/bin/mongod --config /etc/mongod.conf'.
|
Program terminated with signal SIGSEGV, Segmentation fault.
|
#0 0x000055a055baf70b in tc_malloc ()
|
[Current thread is 1 (LWP 3654)]
|
(gdb) bt
|
#0 0x000055a055baf70b in tc_malloc ()
|
#1 0x000055a055a95b69 in boost::log::v2s_mt_posix::attribute::impl::operator new(unsigned long) ()
|
#2 0x000055a05595b235 in mongo::logv2::detail::doLogImpl(int, mongo::logv2::LogSeverity const&, mongo::logv2::LogOptions const&, mongo::StringData, mongo::logv2::TypeErasedAttributeStorage const&) ()
|
#3 0x000055a0559896a0 in mongo::(anonymous namespace)::writeMallocFreeStreamToLog() ()
|
#4 0x000055a055989a14 in abruptQuitWithAddrSignal ()
|
#5 <signal handler called>
|
#6 0x000055a055b050eb in tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned int, int) ()
|
Backtrace stopped: Cannot access memory at address 0x7f5f24d4a448
|
|
|
one interesting note.. after systemctl stop mongod (it looks like systemd kills mongod). But this time when we restart mongod, it loads up and the data files do not look corrupted.
|
|
I've also attached gdb thread apply backtrace to this ticket.
|
|
more gdb thread info continued..
134 Thread 0x7f7d51abb700 (LWP 368795) "conn140" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
135 Thread 0x7f7d519ba700 (LWP 368796) "conn141" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
136 Thread 0x7f7d518b9700 (LWP 368797) "conn142" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
137 Thread 0x7f7d517b8700 (LWP 368798) "conn143" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
138 Thread 0x7f7d516b7700 (LWP 368799) "conn144" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
139 Thread 0x7f7d515b6700 (LWP 368800) "conn145" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
140 Thread 0x7f7d514b5700 (LWP 368801) "conn146" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
141 Thread 0x7f7d513b4700 (LWP 368802) "conn147" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
142 Thread 0x7f7d512b3700 (LWP 368803) "conn148" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
143 Thread 0x7f7d511b2700 (LWP 368804) "conn149" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
144 Thread 0x7f7d50fb0700 (LWP 368806) "conn151" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
145 Thread 0x7f7d50eaf700 (LWP 368807) "conn152" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
146 Thread 0x7f7d50dae700 (LWP 368808) "conn153" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
147 Thread 0x7f7d50cad700 (LWP 368809) "conn154" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
148 Thread 0x7f7d50bac700 (LWP 368810) "conn155" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
149 Thread 0x7f7d509aa700 (LWP 368812) "conn157" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
150 Thread 0x7f7d508a9700 (LWP 368813) "conn158" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
151 Thread 0x7f7d507a8700 (LWP 368814) "conn159" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
152 Thread 0x7f7d506a7700 (LWP 368815) "conn160" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
153 Thread 0x7f7d505a6700 (LWP 368816) "conn161" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
154 Thread 0x7f7d504a5700 (LWP 368817) "conn162" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
155 Thread 0x7f7d503a4700 (LWP 368818) "conn163" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
156 Thread 0x7f7d502a3700 (LWP 368819) "conn164" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
157 Thread 0x7f7d501a2700 (LWP 368820) "conn165" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
158 Thread 0x7f7d500a1700 (LWP 368821) "conn166" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
159 Thread 0x7f7d4ffa0700 (LWP 368822) "conn167" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
160 Thread 0x7f7d4fe9f700 (LWP 368823) "conn168" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
161 Thread 0x7f7d4fd9e700 (LWP 368824) "conn169" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
162 Thread 0x7f7d4fc9d700 (LWP 368825) "conn170" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
163 Thread 0x7f7d4fb9c700 (LWP 368826) "conn171" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
164 Thread 0x7f7d4fa9b700 (LWP 368827) "conn173" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
165 Thread 0x7f7d4f99a700 (LWP 368828) "conn174" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
166 Thread 0x7f7d4f899700 (LWP 368829) "conn175" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
167 Thread 0x7f7d4f798700 (LWP 368830) "conn176" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
168 Thread 0x7f7d4f697700 (LWP 368831) "conn177" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
169 Thread 0x7f7d4f596700 (LWP 368832) "conn178" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
170 Thread 0x7f7d4f495700 (LWP 368833) "conn179" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
171 Thread 0x7f7d4f394700 (LWP 368834) "conn180" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
172 Thread 0x7f7d4f192700 (LWP 368836) "conn182" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
173 Thread 0x7f7d4f091700 (LWP 368837) "conn183" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
174 Thread 0x7f7d4ef90700 (LWP 368838) "conn184" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
175 Thread 0x7f7d4ee8f700 (LWP 368839) "conn185" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
176 Thread 0x7f7d4ed8e700 (LWP 368840) "conn186" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
177 Thread 0x7f7d4ec8d700 (LWP 368841) "conn187" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
178 Thread 0x7f7d4ea8b700 (LWP 368843) "conn189" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
179 Thread 0x7f7d4e889700 (LWP 368845) "conn191" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
180 Thread 0x7f7d4e788700 (LWP 368846) "conn192" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
181 Thread 0x7f7d6a53c700 (LWP 368965) "Shardin.Fixed-3" futex_wait_cancelable (private=0, expected=0, futex_word=0x55582423d614) at ../sysdeps/nptl/futex-internal.h:186
|
182 Thread 0x7f7d5d8f4700 (LWP 371427) "conn258" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
183 Thread 0x7f7d7c379700 (LWP 382656) "conn522" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
184 Thread 0x7f7d5e702700 (LWP 383065) "conn538" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
185 Thread 0x7f7d65532700 (LWP 383066) "conn539" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
186 Thread 0x7f7d5e2fe700 (LWP 383081) "conn540" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
187 Thread 0x7f7d64f2c700 (LWP 393515) "conn779" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
|
188 Thread 0x7f7d65330700 (LWP 394623) "conn800" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
189 Thread 0x7f7d5d5f1700 (LWP 398238) "conn876" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
190 Thread 0x7f7d7a55c700 (LWP 410038) "conn1163" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
|
191 Thread 0x7f7d7a45b700 (LWP 414328) "conn1258" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
192 Thread 0x7f7d7a158700 (LWP 414433) "conn1260" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
193 Thread 0x7f7d50aab700 (LWP 418820) "conn1355" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
194 Thread 0x7f7d4e98a700 (LWP 418821) "conn1356" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
195 Thread 0x7f7d5f00b700 (LWP 418822) "conn1357" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
196 Thread 0x7f7d5ea05700 (LWP 419500) "conn1377" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
197 Thread 0x7f7d65431700 (LWP 419539) "conn1378" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
198 Thread 0x7f7d7c177700 (LWP 420331) "conn1395" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
199 Thread 0x7f7d7a959700 (LWP 423053) "conn1465" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
200 Thread 0x7f7d7bf75700 (LWP 424649) "conn1497" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
201 Thread 0x7f7d6522f700 (LWP 426109) "conn1535" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
202 Thread 0x7f7d7a057700 (LWP 426785) "conn1549" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
203 Thread 0x7f7d77253700 (LWP 427128) "conn1560" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
204 Thread 0x7f7d7c47a700 (LWP 428018) "conn1580" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
205 Thread 0x7f7d60427700 (LWP 428019) "conn1581" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
|
206 Thread 0x7f7d6c540700 (LWP 428939) "ReplWri.ker-650" __lll_lock_wait (futex=futex@entry=0x5558295d8430, private=0) at lowlevellock.c:52
|
207 Thread 0x7f7d562d2700 (LWP 428957) "ReplWri.ker-656" __pthread_clockjoin_ex (threadid=140176268261120, thread_return=0x0, clockid=<optimized out>, abstime=<optimized out>, block=<optimized out>) at pthread_join_common.c:145
|
208 Thread 0x7f7d7f581700 (LWP 429045) "ReplWri.ker-657" __lll_lock_wait (futex=futex@entry=0x5558295d8430, private=0) at lowlevellock.c:52
|
209 Thread 0x7f7d7ad5d700 (LWP 429555) "conn1608" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
210 Thread 0x7f7d54acf700 (LWP 429652) "ReplWri.ker-658" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
|
211 Thread 0x7f7d69d3b700 (LWP 429653) "ReplWri.ker-659" __lll_lock_wait (futex=futex@entry=0x5558295d8430, private=0) at lowlevellock.c:52
|
212 Thread 0x7f7d582d6700 (LWP 429655) "ReplWri.ker-661" __lll_lock_wait (futex=futex@entry=0x5558295d8430, private=0) at lowlevellock.c:52
|
213 Thread 0x7f7d572d4700 (LWP 429667) "ReplWri.ker-664" __lll_lock_wait (futex=futex@entry=0x5558295d8430, private=0) at lowlevellock.c:52
|
214 Thread 0x7f7d55ad1700 (LWP 429668) "ReplWri.ker-665" __lll_lock_wait (futex=futex@entry=0x5558295d8430, private=0) at lowlevellock.c:52
|
215 Thread 0x7f7d5a2da700 (LWP 429853) "ReplCoord-101" futex_wait_cancelable (private=0, expected=0, futex_word=0x5558241e8490) at ../sysdeps/nptl/futex-internal.h:186
|
216 Thread 0x7f7d5aadb700 (LWP 430187) "ReplWri.ker-666" __lll_lock_wait (futex=futex@entry=0x5558295d8430, private=0) at lowlevellock.c:52
|
217 Thread 0x7f7d56ad3700 (LWP 430188) "ReplWri.ker-667" __lll_lock_wait (futex=futex@entry=0x5558295d8430, private=0) at lowlevellock.c:52
|
218 Thread 0x7f7d592d8700 (LWP 430189) "ReplWri.ker-668" __lll_lock_wait (futex=futex@entry=0x5558295d8430, private=0) at lowlevellock.c:52
|
219 Thread 0x7f7d7a858700 (LWP 430204) "conn1617" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
220 Thread 0x7f7d77354700 (LWP 430387) "conn1618" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
221 Thread 0x7f7d57ad5700 (LWP 430400) "ReplWri.ker-672" __lll_lock_wait (futex=futex@entry=0x5558295d8430, private=0) at lowlevellock.c:52
|
222 Thread 0x7f7d77556700 (LWP 430499) "conn1619" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
|
223 Thread 0x7f7d7c076700 (LWP 430708) "conn1628" __lll_lock_wait (futex=futex@entry=0x5558226ba940 <mongo::(anonymous namespace)::MallocFreeOStreamGuard::_streamMutex>, private=0) at lowlevellock.c:52
|
224 Thread 0x7f7d7c278700 (LWP 430821) "conn1629" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
|
225 Thread 0x7f7d5b2dc700 (LWP 431132) "ShardRe.try-810" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
|
226 Thread 0x7f7d542ce700 (LWP 431151) "CatalogCache-83" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
|
227 Thread 0x7f7d58ad7700 (LWP 431164) "ShardRe.try-811" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
|
|
|
gdb thread info
gdb thread info:
(gdb) info threads
|
Id Target Id Frame
|
* 1 Thread 0x7f7d8b59acc0 (LWP 368598) "mongod" futex_wait_cancelable (private=0, expected=0,
|
futex_word=0x5558226ba828 <mongo::(anonymous namespace)::shutdownTasksComplete+40>) at ../sysdeps/nptl/futex-internal.h:186
|
2 Thread 0x7f7d8b599700 (LWP 368601) "SignalHandler" 0x00007f7d8c015ba2 in __GI___sigtimedwait (set=0x7f7d8b598170, info=0x7f7d8b5981f8, timeout=0x0)
|
at ../sysdeps/unix/sysv/linux/sigtimedwait.c:29
|
3 Thread 0x7f7d8ad98700 (LWP 368602) "Backgro.kSource" futex_wait_cancelable (private=0, expected=0, futex_word=0x5558241ecb7c)
|
at ../sysdeps/nptl/futex-internal.h:186
|
4 Thread 0x7f7d8a597700 (LWP 368603) "OCSPMan.rHTTP-0" futex_wait_cancelable (private=0, expected=0, futex_word=0x5558241e8f10)
|
at ../sysdeps/nptl/futex-internal.h:186
|
5 Thread 0x7f7d89d96700 (LWP 368604) "Certifi.Monitor" futex_abstimed_wait_cancelable (private=0, abstime=0x7f7d89d957e0, clockid=-1982245168,
|
expected=0, futex_word=0x55582417b780) at ../sysdeps/nptl/futex-internal.h:323
|
6 Thread 0x7f7d89595700 (LWP 368605) "FlowCon.fresher" futex_abstimed_wait_cancelable (private=0, abstime=0x7f7d895947e0, clockid=-1990637872,
|
expected=0, futex_word=0x55582417bc00) at ../sysdeps/nptl/futex-internal.h:323
|
7 Thread 0x7f7d88d94700 (LWP 368606) "mongod" futex_abstimed_wait_cancelable (private=0, abstime=0x7f7d88d93840, clockid=-1999030368,
|
expected=0, futex_word=0x555824200bdc) at ../sysdeps/nptl/futex-internal.h:323
|
8 Thread 0x7f7d88593700 (LWP 368611) "mongod" futex_abstimed_wait_cancelable (private=0, abstime=0x7f7d88592880, clockid=-2007423008,
|
expected=0, futex_word=0x5558242012d8) at ../sysdeps/nptl/futex-internal.h:323
|
9 Thread 0x7f7d87d92700 (LWP 368612) "mongod" futex_abstimed_wait_cancelable (private=0, abstime=0x7f7d87d91870, clockid=-2015815728,
|
expected=0, futex_word=0x5558242013d8) at ../sysdeps/nptl/futex-internal.h:323
|
10 Thread 0x7f7d87591700 (LWP 368613) "mongod" futex_abstimed_wait_cancelable (private=0, abstime=0x7f7d87590860, clockid=-2024208448,
|
expected=0, futex_word=0x5558242014dc) at ../sysdeps/nptl/futex-internal.h:323
|
11 Thread 0x7f7d86d90700 (LWP 368614) "mongod" futex_abstimed_wait_cancelable (private=0, abstime=0x7f7d86d8f750, clockid=-2032601424,
|
expected=0, futex_word=0x5558242006dc) at ../sysdeps/nptl/futex-internal.h:323
|
12 Thread 0x7f7d8658f700 (LWP 368615) "mongod" futex_abstimed_wait_cancelable (private=0, abstime=0x7f7d8658e750, clockid=-2040994128,
|
expected=0, futex_word=0x5558242006dc) at ../sysdeps/nptl/futex-internal.h:323
|
13 Thread 0x7f7d85d8e700 (LWP 368616) "mongod" futex_abstimed_wait_cancelable (private=0, abstime=0x7f7d85d8d750, clockid=-2049386832,
|
expected=0, futex_word=0x5558242006dc) at ../sysdeps/nptl/futex-internal.h:323
|
14 Thread 0x7f7d8558d700 (LWP 368617) "mongod" futex_abstimed_wait_cancelable (private=0, abstime=0x7f7d8558c750, clockid=-2057779536,
|
expected=0, futex_word=0x5558242006dc) at ../sysdeps/nptl/futex-internal.h:323
|
15 Thread 0x7f7d84d8c700 (LWP 368618) "mongod" futex_abstimed_wait_cancelable (private=0, abstime=0x7f7d84d8b860, clockid=-2066171968,
|
expected=0, futex_word=0x555824201958) at ../sysdeps/nptl/futex-internal.h:323
|
16 Thread 0x7f7d8458b700 (LWP 368619) "WTIdleS.Sweeper" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
|
17 Thread 0x7f7d83d8a700 (LWP 368620) "OplogVi.yThread" futex_wait_cancelable (private=0, expected=0, futex_word=0x555824256038)
|
at ../sysdeps/nptl/futex-internal.h:186
|
18 Thread 0x7f7d83589700 (LWP 368621) "JournalFlusher" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
19 Thread 0x7f7d82d88700 (LWP 368622) "Checkpointer" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
20 Thread 0x7f7d82587700 (LWP 368623) "Collect.xecutor" futex_wait_cancelable (private=0, expected=0, futex_word=0x555828f8f298)
|
at ../sysdeps/nptl/futex-internal.h:186
|
21 Thread 0x7f7d81d86700 (LWP 368624) "ftdc" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
|
22 Thread 0x7f7d81585700 (LWP 368625) "DeadlineMonitor" futex_abstimed_wait_cancelable (private=0, abstime=0x7f7d81584830, clockid=-2124921040,
|
expected=0, futex_word=0x555828fb67e0) at ../sysdeps/nptl/futex-internal.h:323
|
23 Thread 0x7f7d7fd82700 (LWP 368629) "ShardRegistry" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
|
24 Thread 0x7f7d7ed80700 (LWP 368631) "TaskExe.rPool-0" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
|
25 Thread 0x7f7d7e57f700 (LWP 368632) "Replica.xecutor" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
|
26 Thread 0x7f7d7dd7e700 (LWP 368633) "ShardRe.Updater" 0x00007f7d8c0d8d56 in epoll_wait (epfd=56, events=0x7f7d7dd7cf70, maxevents=128, timeout=-1)
|
at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
|
27 Thread 0x7f7d7d57d700 (LWP 368634) "Replica.xecutor" __lll_lock_wait (
|
futex=futex@entry=0x5558226ba940 <mongo::(anonymous namespace)::MallocFreeOStreamGuard::_streamMutex>, private=0) at lowlevellock.c:52
|
28 Thread 0x7f7d7cd7c700 (LWP 368635) "monitor.or-HMAC" 0x00007f7d8c0ccd2f in __GI___poll (fds=0x55582427d640, nfds=1, timeout=1727999899)
|
--Type <RET> for more, q to quit, c to continue without paging--
|
x/sysv/linux/poll.c:29
|
29 Thread 0x7f7d7bd7a700 (LWP 368637) "replSet.kPinger" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
30 Thread 0x7f7d7b55e700 (LWP 368638) "ShardRegistry" futex_wait_cancelable (private=0, expected=0, futex_word=0x555829161698) at ../sysdeps/nptl/futex-internal.h:186
|
31 Thread 0x7f7d79d5b700 (LWP 368641) "FreeMonNet" futex_wait_cancelable (private=0, expected=0, futex_word=0x555829162f98) at ../sysdeps/nptl/futex-internal.h:186
|
32 Thread 0x7f7d7955a700 (LWP 368642) "FreeMonHTTP-0" futex_wait_cancelable (private=0, expected=0, futex_word=0x55582423dd10) at ../sysdeps/nptl/futex-internal.h:186
|
33 Thread 0x7f7d78d59700 (LWP 368643) "FreeMon.ocessor" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
|
34 Thread 0x7f7d78558700 (LWP 368644) "OplogCa.plog.rs" futex_wait_cancelable (private=0, expected=0, futex_word=0x55582423b6a8) at ../sysdeps/nptl/futex-internal.h:186
|
35 Thread 0x7f7d77d57700 (LWP 368645) "ReplNetwork" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
|
36 Thread 0x7f7d76554700 (LWP 368648) "Reshard.Network" futex_wait_cancelable (private=0, expected=0, futex_word=0x555829250998) at ../sysdeps/nptl/futex-internal.h:186
|
37 Thread 0x7f7d75d53700 (LWP 368649) "Reshard.rvice-0" futex_wait_cancelable (private=0, expected=0, futex_word=0x5558292a4490) at ../sysdeps/nptl/futex-internal.h:186
|
38 Thread 0x7f7d75552700 (LWP 368650) "TenantM.Network" futex_wait_cancelable (private=0, expected=0, futex_word=0x555829251098) at ../sysdeps/nptl/futex-internal.h:186
|
39 Thread 0x7f7d74d51700 (LWP 368651) "TenantM.rvice-0" futex_wait_cancelable (private=0, expected=0, futex_word=0x5558292a4810) at ../sysdeps/nptl/futex-internal.h:186
|
40 Thread 0x7f7d74550700 (LWP 368652) "Reshard.Network" futex_wait_cancelable (private=0, expected=0, futex_word=0x5558291a0198) at ../sysdeps/nptl/futex-internal.h:186
|
41 Thread 0x7f7d73d4f700 (LWP 368653) "Reshard.rvice-0" futex_wait_cancelable (private=0, expected=0, futex_word=0x5558292a4b90) at ../sysdeps/nptl/futex-internal.h:186
|
42 Thread 0x7f7d7354e700 (LWP 368654) "Shardin.Network" futex_wait_cancelable (private=0, expected=0, futex_word=0x55582919ff98) at ../sysdeps/nptl/futex-internal.h:186
|
43 Thread 0x7f7d72d4d700 (LWP 368655) "Shardin.nator-0" futex_wait_cancelable (private=0, expected=0, futex_word=0x5558292a4f10) at ../sysdeps/nptl/futex-internal.h:186
|
44 Thread 0x7f7d7254c700 (LWP 368656) "RenameC.Network" futex_wait_cancelable (private=0, expected=0, futex_word=0x555829244d98) at ../sysdeps/nptl/futex-internal.h:186
|
45 Thread 0x7f7d71d4b700 (LWP 368657) "RenameC.rvice-0" futex_wait_cancelable (private=0, expected=0, futex_word=0x5558292a5290) at ../sysdeps/nptl/futex-internal.h:186
|
46 Thread 0x7f7d7154a700 (LWP 368658) "TenantM.Network" futex_wait_cancelable (private=0, expected=0, futex_word=0x555829243398) at ../sysdeps/nptl/futex-internal.h:186
|
47 Thread 0x7f7d70d49700 (LWP 368659) "TenantM.rvice-0" futex_wait_cancelable (private=0, expected=0, futex_word=0x5558292a5610) at ../sysdeps/nptl/futex-internal.h:186
|
48 Thread 0x7f7d70548700 (LWP 368660) "TenantF.xecutor" futex_wait_cancelable (private=0, expected=0, futex_word=0x555829243498) at ../sysdeps/nptl/futex-internal.h:186
|
49 Thread 0x7f7d6fd47700 (LWP 368661) "TTLMonitor" __lll_lock_wait (futex=futex@entry=0x5558226ba940 <mongo::(anonymous namespace)::MallocFreeOStreamGuard::_streamMutex>, private=0) at lowlevellock.c:52
|
50 Thread 0x7f7d6f546700 (LWP 368662) "Timesta.Monitor" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
51 Thread 0x7f7d6ed45700 (LWP 368663) "FLECrudNetwork" futex_wait_cancelable (private=0, expected=0, futex_word=0x55582917e498) at ../sysdeps/nptl/futex-internal.h:186
|
52 Thread 0x7f7d6e544700 (LWP 368664) "FLECrud-0" futex_wait_cancelable (private=0, expected=0, futex_word=0x5558292a5990) at ../sysdeps/nptl/futex-internal.h:186
|
53 Thread 0x7f7d6dd43700 (LWP 368665) "clientcursormon" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
54 Thread 0x7f7d6d542700 (LWP 368666) "Periodi.kRunner" __lll_lock_wait (futex=futex@entry=0x555823fb1bc8, private=0) at lowlevellock.c:52
|
55 Thread 0x7f7d6cd41700 (LWP 368667) "SessionKiller" futex_wait_cancelable (private=0, expected=0, futex_word=0x5558293ddb30) at ../sysdeps/nptl/futex-internal.h:186
|
56 Thread 0x7f7d6bd3f700 (LWP 368669) "abortEx.actions" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
57 Thread 0x7f7d6ad3d700 (LWP 368671) "ChangeS.Remover" __lll_lock_wait (futex=futex@entry=0x555824159008, private=0) at lowlevellock.c:52
|
58 Thread 0x7f7d6953a700 (LWP 368674) "Logical.Refresh" 0x00007f7d8c0ccd2f in __GI___poll (fds=0x55582af57d28, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
|
59 Thread 0x7f7d68d39700 (LWP 368675) "ReplNetwork" futex_wait_cancelable (private=0, expected=0, futex_word=0x55582952fc98) at ../sysdeps/nptl/futex-internal.h:186
|
60 Thread 0x7f7d68538700 (LWP 368676) "Logical.cheReap" 0x00007f7d8c0ccd2f in __GI___poll (fds=0x55582427df50, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
|
61 Thread 0x7f7d67d37700 (LWP 368677) "OplogAp.Network" futex_wait_cancelable (private=0, expected=0, futex_word=0x55582952ec98) at ../sysdeps/nptl/futex-internal.h:186
|
62 Thread 0x7f7d67536700 (LWP 368678) "OplogApplier-0" __lll_lock_wait (futex=futex@entry=0x555824159008, private=0) at lowlevellock.c:52
|
63 Thread 0x7f7d66d35700 (LWP 368679) "ReplCoo.Network" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
|
64 Thread 0x7f7d66534700 (LWP 368680) "MirrorMaestro" futex_wait_cancelable (private=0, expected=0, futex_word=0x555829251a98) at ../sysdeps/nptl/futex-internal.h:186
|
65 Thread 0x7f7d65d33700 (LWP 368681) "ReplCoo.xtern-0" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
66 Thread 0x7f7d64d31700 (LWP 368683) "MirrorMaestro-0" futex_wait_cancelable (private=0, expected=0, futex_word=0x5558292a5d10) at ../sysdeps/nptl/futex-internal.h:186
|
67 Thread 0x7f7d64530700 (LWP 368684) "Topolog.bserver" 0x00007f7d8c0ccd2f in __GI___poll (fds=0x55582427e0a8, nfds=1, timeout=-1) at ../sysdeps/unix/sysv/linux/poll.c:29
|
68 Thread 0x7f7d63d2f700 (LWP 368685) "BackgroundSync" futex_wait_cancelable (private=0, expected=0, futex_word=0x55582a621be0) at ../sysdeps/nptl/futex-internal.h:186
|
69 Thread 0x7f7d6352e700 (LWP 368686) "Service.Fixed-0" 0x00007f7d8c0d8d56 in epoll_wait (epfd=139, events=0x7f7d6352c500, maxevents=128, timeout=-1) at ../sysdeps/unix/sysv/linux/epoll_wait.c:30
|
70 Thread 0x7f7d6252c700 (LWP 368688) "listener" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
71 Thread 0x7f7d61d2b700 (LWP 368689) "SyncSou.eedback" futex_wait_cancelable (private=0, expected=0, futex_word=0x7f7d61d2a798) at ../sysdeps/nptl/futex-internal.h:186
|
72 Thread 0x7f7d6152a700 (LWP 368690) "ReplBatcher" __lll_lock_wait (futex=futex@entry=0x555823fb1bc8, private=0) at lowlevellock.c:52
|
73 Thread 0x7f7d60d29700 (LWP 368691) "ApplyBa.Journal" futex_wait_cancelable (private=0, expected=0, futex_word=0x5558295b9200) at ../sysdeps/nptl/futex-internal.h:186
|
74 Thread 0x7f7d60528700 (LWP 368692) "conn45" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
75 Thread 0x7f7d60326700 (LWP 368694) "conn49" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
|
76 Thread 0x7f7d60225700 (LWP 368695) "conn52" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
77 Thread 0x7f7d60124700 (LWP 368696) "conn54" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
78 Thread 0x7f7d60023700 (LWP 368697) "conn55" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
79 Thread 0x7f7d5ff22700 (LWP 368698) "conn57" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
80 Thread 0x7f7d5fd20700 (LWP 368700) "conn59" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
81 Thread 0x7f7d5fc1f700 (LWP 368701) "conn60" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
82 Thread 0x7f7d5fa1d700 (LWP 368703) "conn62" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
83 Thread 0x7f7d5f91c700 (LWP 368704) "conn63" __libc_recvmsg (flags=0, msg=0x7f7d5f91acd0, fd=171) at ../sysdeps/unix/sysv/linux/recvmsg.c:28
|
84 Thread 0x7f7d5f81b700 (LWP 368705) "conn64" __libc_recvmsg (flags=0, msg=0x7f7d5f819cd0, fd=173) at ../sysdeps/unix/sysv/linux/recvmsg.c:28
|
85 Thread 0x7f7d5f71a700 (LWP 368706) "conn65" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
86 Thread 0x7f7d5f619700 (LWP 368707) "conn66" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
87 Thread 0x7f7d5f510700 (LWP 368708) "conn67" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
88 Thread 0x7f7d5f40f700 (LWP 368709) "conn68" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
89 Thread 0x7f7d5f30e700 (LWP 368710) "conn69" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
90 Thread 0x7f7d5f20d700 (LWP 368711) "conn70" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
91 Thread 0x7f7d5f10c700 (LWP 368712) "conn71" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
92 Thread 0x7f7d5ee09700 (LWP 368715) "conn74" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
93 Thread 0x7f7d5ed08700 (LWP 368716) "conn75" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
94 Thread 0x7f7d5ec07700 (LWP 368717) "conn76" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
95 Thread 0x7f7d5eb06700 (LWP 368718) "conn77" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
96 Thread 0x7f7d5e500700 (LWP 368724) "conn83" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
97 Thread 0x7f7d5e3ff700 (LWP 368725) "conn84" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
98 Thread 0x7f7d5e1fd700 (LWP 368727) "conn86" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
99 Thread 0x7f7d5e0fc700 (LWP 368728) "conn87" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
100 Thread 0x7f7d5dffb700 (LWP 368729) "conn88" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
101 Thread 0x7f7d5defa700 (LWP 368730) "conn89" __lll_lock_wait (futex=futex@entry=0x5558226ba940 <mongo::(anonymous namespace)::MallocFreeOStreamGuard::_streamMutex>, private=0) at lowlevellock.c:52
|
102 Thread 0x7f7d5dbf7700 (LWP 368733) "conn92" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
103 Thread 0x7f7d5d9f5700 (LWP 368735) "conn94" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
104 Thread 0x7f7d5d6f2700 (LWP 368738) "conn97" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
105 Thread 0x7f7d5d4f0700 (LWP 368740) "conn99" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
106 Thread 0x7f7d5d3ef700 (LWP 368741) "conn100" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
107 Thread 0x7f7d5d0ec700 (LWP 368744) "conn103" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
108 Thread 0x7f7d5cfeb700 (LWP 368745) "conn104" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
109 Thread 0x7f7d5cce8700 (LWP 368748) "conn107" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
110 Thread 0x7f7d5cae6700 (LWP 368750) "conn109" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
111 Thread 0x7f7d5c9e5700 (LWP 368751) "conn110" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
112 Thread 0x7f7d5c7e3700 (LWP 368753) "conn112" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
113 Thread 0x7f7d5c5e1700 (LWP 368755) "ReplCoo.Network" futex_wait_cancelable (private=0, expected=0, futex_word=0x55582a63919c) at ../sysdeps/nptl/futex-internal.h:186
|
114 Thread 0x7f7d5bde0700 (LWP 368758) "ReplCoo.xtern-1" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
115 Thread 0x7f7d5b5df700 (LWP 368759) "conn116" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
116 Thread 0x7f7d5b3dd700 (LWP 368761) "conn118" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
|
117 Thread 0x7f7d532cc700 (LWP 368778) "TaskExe.rPool-0" futex_wait_cancelable (private=0, expected=0, futex_word=0x5558313d159c) at ../sysdeps/nptl/futex-internal.h:186
|
118 Thread 0x7f7d52acb700 (LWP 368779) "conn120" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
119 Thread 0x7f7d529ca700 (LWP 368780) "conn121" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
120 Thread 0x7f7d528c9700 (LWP 368781) "conn122" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
121 Thread 0x7f7d527c8700 (LWP 368782) "conn127" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
122 Thread 0x7f7d526c7700 (LWP 368783) "conn128" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
123 Thread 0x7f7d525c6700 (LWP 368784) "conn129" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
124 Thread 0x7f7d524c5700 (LWP 368785) "conn130" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
125 Thread 0x7f7d523c4700 (LWP 368786) "conn131" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
126 Thread 0x7f7d522c3700 (LWP 368787) "conn132" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
127 Thread 0x7f7d521c2700 (LWP 368788) "conn133" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
128 Thread 0x7f7d520c1700 (LWP 368789) "conn134" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
129 Thread 0x7f7d51fc0700 (LWP 368790) "conn135" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
130 Thread 0x7f7d51ebf700 (LWP 368791) "conn136" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
131 Thread 0x7f7d51dbe700 (LWP 368792) "conn137" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
132 Thread 0x7f7d51cbd700 (LWP 368793) "conn138" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
133 Thread 0x7f7d51bbc700 (LWP 368794) "conn139" __lll_lock_wait (futex=futex@entry=0x555824013e28, private=0) at lowlevellock.c:52
|
--Type <RET> for more, q to quit, c to continue without paging--
|
|
|
|
|
on one of servers in the 5-node repliaset (ashdb3), we are running debian 11. This server is exhibiting similar behavior. The process looks stuck. Systemctl status mongod says it's running. But trying to connect via mongosh is stuck.
pndb3 is running the ASAN build, however it's running fine. pndb3 typically has issues around the same time as ashdb3, however pndb3 is running fine right now. I'm going to kill and restart ashdb3 now. However, if there are any commands you would like me to run on the stuck process for gathering diagnostics next time, please let me know.
here is some output for some commands:
strace: Process 368598 attached
|
futex(0x5558226ba828, FUTEX_WAIT_PRIVATE, 0, NULL
|
top shows
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
|
368598 mongodb 20 0 11.1g 8.0g 64284 S 3.7 6.4 40:11.10 mongod
|
Here is gdb -p <pid> backtrace:
For help, type "help".
|
Type "apropos word" to search for commands related to "word".
|
Attaching to process 368598
|
[New LWP 368601] (REPEATED about 226 TIMES FOR EACH THREAD)
|
|
[Thread debugging using libthread_db enabled]
|
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
|
--Type <RET> for more, q to quit, c to continue without paging--
|
futex_wait_cancelable (private=0, expected=0, futex_word=0x5558226ba828 <mongo::(anonymous namespace)::shutdownTasksComplete+40>)
|
at ../sysdeps/nptl/futex-internal.h:186
|
186 ../sysdeps/nptl/futex-internal.h: No such file or directory.
|
(gdb) bt
|
#0 futex_wait_cancelable (private=0, expected=0, futex_word=0x5558226ba828 <mongo::(anonymous namespace)::shutdownTasksComplete+40>)
|
at ../sysdeps/nptl/futex-internal.h:186
|
#1 __pthread_cond_wait_common (abstime=0x0, clockid=0, mutex=0x555823f2cc10, cond=0x5558226ba800 <mongo::(anonymous namespace)::shutdownTasksComplete>)
|
at pthread_cond_wait.c:508
|
#2 __pthread_cond_wait (cond=0x5558226ba800 <mongo::(anonymous namespace)::shutdownTasksComplete>, mutex=0x555823f2cc10) at pthread_cond_wait.c:638
|
#3 0x000055582111685c in std::condition_variable::wait(std::unique_lock<std::mutex>&) ()
|
#4 0x0000555820f6ee37 in mongo::waitForShutdown() ()
|
#5 0x000055581e28b244 in mongo::(anonymous namespace)::_initAndListen(mongo::ServiceContext*, int) [clone .isra.2019] ()
|
#6 0x000055581e28d521 in mongo::mongod_main(int, char**) ()
|
#7 0x000055581e07c2fe in main ()
|
|
|
Hi Eric
These are bare metal servers. No containers involved. There are no other pieces of software that are running on the boxes that would interfere. They are purely used for mongod. There are zabbix monitoring agents that periodically run mongosh to collect primary/secondary status and some other stats.
We are not able to downgrade to MongoDB 5.0. We actually upgrade because we were facing a lot of different issues with 5.0. It's been a couple months, but based on my memory, some of the issues we previously faced in 5.0 seem similar to the current issue in 6.0.5.. But I can't say for sure. Upgrading from 6.0.4 to 6.0.5 resolved a bunch of stability issues we were having with mongos though. It feels like we are really close to stability here, so hopefully catching the crash with the ASAN build will help to identify the issue quickly. Thank you for all your help!
|
|
Hi amit.gupta@opensense.com, some key questions we'd like to understand about your environment to determine if other applications could be interfering with MongoDB's actions:
- Can you elaborate in detail on your hosting environment? We're curious if we should consider containers and if so, what kind.
- Can you provide detailed information about other processes are running on the same machine? We'd appreciate attention to any program(s) that may attempt to access MongoDB files or alter/interact with the MongoDB binary. This would include auditware and/or virus scanners. Dynatrace is a commonly used application that can affect MongoDB's behavior
Finally, it would be helpful to know if you are able to see if MongoDB 5.0.x has the same problems you've observed so far.
|
|
Thanks Jason, it looks like it works now!
Just a quick note that htop shows about 20 terabytes of virtual memory in use by mongod. Only about 1GB or resident memory is in use though.. so not that big of a deal.
|
|
amit.gupta@opensense.com, could you clarify what you mean by getting a little further? I expect that the error associated with the stack trace above came with an asan.log. Would you be able to give us the output? I suspect it might be still complaining about the llvm-symbolizer error that we saw previously.
I think the instructions that were presented earlier when we first presented the ASAN build may have been slightly incorrect. I think we should have run
export ASAN_SYMBOLIZER_PATH=/usr/bin/llvm-symbolizer instead of
export ASAN_SYMBOLIZER_PATH=/usr/bin/llvm-symbolizer<YOUR_VERSION>. I was able to reproduce an issue with your same stacktrace and correcting the above fixed the problem for me
after creating the symlink from the comment here.
I think if this still doesn't work, we can coordinate to see if a live screenshare would be more helpful.
|
|
also, if it's easier, I'm happy to jump on a screenshare to do some realtime debugging/troubleshooting to get the ASAN build running.
|
|
Hi, i seem to get a little further now.. but still get coredump with following backtrace:
Reading symbols from /usr/bin/mongod...
|
[New LWP 41585]
|
[Thread debugging using libthread_db enabled]
|
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
|
Core was generated by `/usr/bin/mongod --config /etc/mongod.conf'.
|
Program terminated with signal SIGABRT, Aborted.
|
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140512531737728) at ./nptl/pthread_kill.c:44
|
44 ./nptl/pthread_kill.c: No such file or directory.
|
(gdb) bt
|
#0 __pthread_kill_implementation (no_tid=0, signo=6, threadid=140512531737728) at ./nptl/pthread_kill.c:44
|
#1 __pthread_kill_internal (signo=6, threadid=140512531737728) at ./nptl/pthread_kill.c:78
|
#2 __GI___pthread_kill (threadid=140512531737728, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
|
#3 0x00007fcba02f7476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
|
#4 0x00007fcba02dd7f3 in __GI_abort () at ./stdlib/abort.c:79
|
#5 0x0000556bf91c7657 in __sanitizer::Abort() () at /data/mci/8c40c004d1e8eb5b84c7cd27b6c53640/toolchain-builder/tmp/build-llvm.sh-Upi/llvm/projects/compiler-rt/lib/sanitizer_common/sanitizer_posix_libcdep.cc:157
|
#6 0x0000556bf91c60a1 in __sanitizer::Die() () at /data/mci/8c40c004d1e8eb5b84c7cd27b6c53640/toolchain-builder/tmp/build-llvm.sh-Upi/llvm/projects/compiler-rt/lib/sanitizer_common/sanitizer_termination.cc:59
|
#7 0x0000556bf91cdd39 in ChooseExternalSymbolizer () at /data/mci/8c40c004d1e8eb5b84c7cd27b6c53640/toolchain-builder/tmp/build-llvm.sh-Upi/llvm/projects/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_posix_libcdep.cc:473
|
#8 ChooseSymbolizerTools () at /data/mci/8c40c004d1e8eb5b84c7cd27b6c53640/toolchain-builder/tmp/build-llvm.sh-Upi/llvm/projects/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_posix_libcdep.cc:516
|
#9 PlatformInit () at /data/mci/8c40c004d1e8eb5b84c7cd27b6c53640/toolchain-builder/tmp/build-llvm.sh-Upi/llvm/projects/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_posix_libcdep.cc:529
|
#10 0x0000556bf91cc1e4 in GetOrInit () at /data/mci/8c40c004d1e8eb5b84c7cd27b6c53640/toolchain-builder/tmp/build-llvm.sh-Upi/llvm/projects/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_libcdep.cc:24
|
#11 0x0000556bf91cdd46 in LateInitialize () at /data/mci/8c40c004d1e8eb5b84c7cd27b6c53640/toolchain-builder/tmp/build-llvm.sh-Upi/llvm/projects/compiler-rt/lib/sanitizer_common/sanitizer_symbolizer_posix_libcdep.cc:534
|
#12 0x0000556bf91b1bcd in AsanInitInternal () at /data/mci/8c40c004d1e8eb5b84c7cd27b6c53640/toolchain-builder/tmp/build-llvm.sh-Upi/llvm/projects/compiler-rt/lib/asan/asan_rtl.cc:497
|
#13 0x00007fcba0bc45be in _dl_init (main_map=0x7fcba0bf92e0, argc=3, argv=0x7ffdb5161d48, env=0x7ffdb5161d68) at ./elf/dl-init.c:102
|
#14 0x00007fcba0bde2ea in _dl_start_user () from /lib64/ld-linux-x86-64.so.2
|
#15 0x0000000000000003 in ?? ()
|
#16 0x00007ffdb51634b8 in ?? ()
|
#17 0x00007ffdb51634c8 in ?? ()
|
#18 0x00007ffdb51634d1 in ?? ()
|
#19 0x0000000000000000 in ?? ()
|
|
|
hey amit.gupta@opensense.com, could you try running the following to see if it let's you get past the symbolizer error:
sudo ln -s /usr/bin/llvm-symbolizer-14 /usr/bin/llvm-symbolizer
|
I think the problem might be that LLVM is looking for a binary with "llvm-symbolizer" exactly but ubuntu installs the packages with a version suffix.
|
|
Going through the steps to install the asan build, I've run into the following issues:
- needed to create a "mongod" subdirectory in /mongodb/var/logs/ (FIXED)
- When trying to start, i get this error in asan.log.158195:
- ==158195==ERROR: External symbolizer path is set to '/usr/bin/llvm-symbolizer-14' which isn't a known symbolizer. Please set the path to the llvm-symbolizer binary or other known tool.
here is the ls -al of the file:
ls -al /usr/bin/llvm-symbolizer-14
lrwxrwxrwx 1 root root 34 Mar 24 2022 /usr/bin/llvm-symbolizer-14 -> ../lib/llvm-14/bin/llvm-symbolizer
any tips on how to get by the symbolizer error?
|
|
Hi all, so we ran into the issue again after about 3 weeks of uptime. I am going to install the debug version now on one of the nodes (we hadn't done so already because we wanted to see if debian 11 helped at all). Here is a brief outline of what happened:
- The crash happened on 3 different nodes (pndb3, ashdb3 and hildb3). (we have a 5 node replicaset)
- pndb3 and ashdb3 are the same nodes that crashed last time. They have identical hardware in 2 different datacenters. They both crashed almost the same time last time. This time, they also become unresponsive around the same time.
- The processes look like they were running, but I could not login to the dbs with mongosh.
- strace -p showed nothing except for this one line:
- futex(0x55598a8b1828, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, NULL, FUTEX_BITSET_MATCH_ANY
- I tried systemctl stop mongod but the process just hung for a while. Eventually it looks like the processes were killed by systemd
- I tried deleting the data directories and resyncing, but they would start to sync then fail soon after.
- Around this time hildb3 (the third node) crashed. Now the shard in the replicaset doesn't have a majority.
- I tried again to delete the data directory and resync pndb3, and this time it worked. Then I resynced ashdb3 and hildb3, and they are all in sync.
What we learned:
- Debian 11 didn't make a difference (on ashdb3)
- The hardware commonality between pndb3 and ashdb3 isn't necessarily a clue because hildb3 is different.
- pndb3 and hildb3 both running Ubuntu 22.04. However pndb3 is running the latest hwe kernel 5.19 and hildb3 is running the standard kernel. So the kernel versions didn't really matter either.
Now I'm focusing on installing the debug build you referenced above on pndb3. It seems like it takes a couple weeks for the problem to show up.
|
|
Thanks Chris, we are about to downgrade our entire DB infrastructure to Debian 11 from Ubuntu 22.04. So I'm going to have the team slot this debug builds in as soon as possible.
Regarding your clarification questions:
- By full sync, I mean we deleted the /var/lib/mongo directory contents and restart mongo. It then does a sync from another member of the replicaset. We do not manually copy the files over from any other server.
- We've been having issues for so long that it's hard to describe the full history. Also keep in mind that some of the crashes were due to 6.0.4 we believe. Things have gotten a lot stabler on the mongod and mongos sides since we upgraded.
- On the mongod crash side, it looks like we had a couple different types with 6.0.4. Either it would crash in a way that could be restarted without a resync, or a crash would happen where the wiredtiger engine complained about corruption.
- If the WT engine complained about corruption, we would either delete the /var/lib/mongo directory and restart or first do a rs.remove() on the node, then do an rs.add(). We had run into an issue where just deleting the directory and then restarting would sync most of the data but then crash near the end of the sync when indexes were getting built (again this was with 6.0.4. we have no experienced that with 6.0.5).
|
|
Some updates:
Custom Build:
The custom build is available for download. This should provide valuable information and help us understand any future occurrences of the issue(s). Some guidance on using this build:
1) Please continue to make sure you have core-dumps enabled.
2) Make sure the directory /mongodb/var/logs/ exists (make it if not), and that you have write permissions to it.
3) Ensure you have llvm-symbolizer installed on the relevant instances. Internally, this involved running the following which resulted in having a version of llvm-symbolizer in /usr/bin:
sudo apt install clang lldb lld
|
4) Set the following environment variables (you will need to reset them if your system restarts)
export ASAN_OPTIONS="detect_leaks=0:use_sigaltstack=0:check_initialization_order=true:strict_init_order=true:abort_on_error=1:disable_coredump=0:handle_abort=1:unmap_shadow_on_exit=1:log_path=/mongodb/var/logs/mongod/asan.log"
|
Make sure to substitute your version:
export ASAN_SYMBOLIZER_PATH=/usr/bin/llvm-symbolizer<YOUR_VERSION>
|
5) Install the build from here (if you haven't already):
https://mciuploads.s3.amazonaws.com/mongodb-mongo-master/ubuntu2204-asan/c9a99c120371d4d4c52cbb15dac34a36ce8d3b1d/dist/mongo-mongodb_mongo_master_ubuntu2204_asan_patch_c9a99c120371d4d4c52cbb15dac34a36ce8d3b1d_641344592fbabe080b19cb03_23_03_16_16_31_49.tgz
Data files:
In order to further investigate and diagnose the problem, we also kindly request that you provide the following files from your MongoDB database directory:
- WiredTiger.turtle
- WiredTiger.wt
Additionally, please include any files referenced in the error messages you have encountered (labelling what node that they came from). This will help us better understand the root cause and identify possible solutions.
Once you have gathered the requested files, you can upload them to the same upload portal.
Clarifications:
Could you also please clarify the following:
- When you say "full sync", do you mean this logical initial sync procedure? If not, can you clarify this process?
- Can you go into some detail about the history of these nodes, such as when these nodes were synced in this way? (The more information, the better!)
Thank you for your cooperation, and we look forward to resolving this issue together!
|
|
Thanks for the additional information amit.gupta@opensense.com.
One of our mechanisms for doing a sync is based on copying the underlying files, which is part of what I was getting at in asking about the provenance of the affected files. I don't know exactly which release(s) support file based sync and whether that is would have be used here.
When was the last time either of the nodes that experienced this failure were synced? Is there any chance one was synced from the other?
It looks like the data from the other failure, the mongocrash-after-6.0.5.tar.gz file, didn't get uploaded successfully – at least I don't see it on my end. But it's good to know that they were on two separate nodes.
|
|
Hi, apologies for the delayed response. I'm traveling this week at a conference.
I think maybe we should open a separate ticket for the recent issue where 2 nodes in the replica set crashed around the same time. This ticket was originally open related to a bunch of issues including server crashes as well as mongos crashes. After updating to 6.0.5, the mongos issues seem to have gone away. Also, we though the server crashes went away because we didn't see one in almost 2 weeks. But then we got both the crashes on Friday.
In any case, here is some more information:
- There are 2 servers that crashed within seconds of each other. DB1 and DB2 (out of a 5 node replica set. DB1-5).
- Both DB1 and DB2 nodes were secondaries at the time they crashed.
- The only method we use to sync data to the DB is via a full sync. We have never seeded data by copying from another system.
- The problem collection is heavily used all the time. Also, would the problem have been detected during a full data sync? If so, it may not be a problem with bad data getting synced over.
- I unfortunately do not have the file anymore that was complained about in the crash. If/when it happens again, I will make sure to make a backup of the file that it complained about.
Keith, you mentioned you only saw one crash in the mongocrash-node-2-after-6.0.5.tar.gz. In the other uploaded file, mongocrash-after-6.0.5.tar.gz, there should be a mongod.log snippet with this:
{"t":{"$date":"2023-03-31T14:55:09.791-04:00"},"s":"E", "c":"WT", "id":22435, "ctx":"Checkpointer","msg":"WiredTiger error message","attr":{"error":-31802,"message":{"ts_sec":1680288909,"ts_usec":791377,"thread":"3764:0x7fbff99d6640","session_dhandle_name":"file:collection-18--1935767879803791750.wt","session_name":"WT_SESSION.checkpoint","category":"WT_VERB_DEFAULT","category_id":9,"verbose_level":"ERROR","verbose_level_id":-3,"msg":"__posix_file_read:435:/var/lib/mongo/collection-18--1935767879803791750.wt: handle-read: pread: failed to read 4096 bytes at offset 110682112","error_str":"WT_ERROR: non-specific WiredTiger error","error_code":-31802}}}
|
This is indeed from a second system (as you alluded to by the GUID comment).
|
|
I'm a little confused by the uploaded information and the core files. Looking though the uploads on the Amphora link, I only see one mongod failure that includes log messages from the time of the most recent failure (March 31). This is mongocrash-node-2-after-6.0.5.tar.gz. It includes the following WiredTiger failure:
{"t":{"$date":"2023-03-31T14:55:02.542-04:00"},"s":"E", "c":"WT", "id":22435, "ctx":"Checkpointer","msg":"WiredTiger error message","attr":{"error":-31802,"message":{"ts_sec":1680288902,"ts_usec":542695,"thread":"5194:0x7fda0d093640","session_dhandle_name":"file:collection-18-6164459054593005979.wt","session_name":"WT_SESSION.checkpoint","category":"WT_VERB_DEFAULT","category_id":9,"verbose_level":"ERROR","verbose_level_id":-3,"msg":"__posix_file_read:435:/var/lib/mongo/collection-18-6164459054593005979.wt: handle-read: pread: failed to read 4096 bytes at offset 109453312","error_str":"WT_ERROR: non-specific WiredTiger error","error_code":-31802}}}
|
{"t":{"$date":"2023-03-31T14:55:02.542-04:00"},"s":"E", "c":"WT", "id":22435, "ctx":"Checkpointer","msg":"WiredTiger error message","attr":{"error":22,"message":{"ts_sec":1680288902,"ts_usec":542756,"thread":"5194:0x7fda0d093640","session_dhandle_name":"file:collection-2-6164459054593005979.wt","session_name":"WT_SESSION.checkpoint","category":"WT_VERB_DEFAULT","category_id":9,"verbose_level":"ERROR","verbose_level_id":-3,"msg":"__wt_block_checkpoint_resolve:929:collection-2-6164459054593005979.wt: the checkpoint failed, the system must restart","error_str":"Invalid argument","error_code":22}}}
|
{"t":{"$date":"2023-03-31T14:55:02.542-04:00"},"s":"E", "c":"WT", "id":22435, "ctx":"Checkpointer","msg":"WiredTiger error message","attr":{"error":-31804,"message":{"ts_sec":1680288902,"ts_usec":542779,"thread":"5194:0x7fda0d093640","session_dhandle_name":"file:collection-2-6164459054593005979.wt","session_name":"WT_SESSION.checkpoint","category":"WT_VERB_DEFAULT","category_id":9,"verbose_level":"ERROR","verbose_level_id":-3,"msg":"__wt_block_checkpoint_resolve:929:the process must exit and restart","error_str":"WT_PANIC: WiredTiger library panic","error_code":-31804}}}
|
This is the same sequence of errors amit.gupta@opensense.com quoted above, except 7 seconds earlier and the collection files (18 and 2) have a different GUID, suggesting this is a different system. Also the problem offset (109453312) is a bit earlier in the file.
So it looks like there is a similar problem with the database on two different nodes.
I think it would be useful to understand the provenance of the files here. Is there a common history – initial sync, backup/restore, etc – that might account for them having similar problems? Is there a chance the file(s) might have been copied incompletely due to some failure?
Also it sounds like this cluster had one set of problems about a week before this new issue manifested? For example have the problem collections be unused (or lightly used) since the earlier problem? Or, more generally, is there an access pattern that would have allowed a problem introduced in the earlier failure to remain unseen until March 31?
|
|
I can't comment on the mongos failure that initially triggered this ticket; mongos doesn't use WiredTiger, so whatever is happening there is unlikely to be a Storage Engines issue.
In the more recent failures that amit.gupta@opensense.com reported on a couple comments up, it looks like the problem was triggered by this error:
{"t":{"$date":"2023-03-31T14:55:09.791-04:00"},"s":"E", "c":"WT", "id":22435, "ctx":"Checkpointer","msg":"WiredTiger error message","attr":{"error":-31802,"message":{"ts_sec":1680288909,"ts_usec":791377,"thread":"3764:0x7fbff99d6640","session_dhandle_name":"file:collection-18--1935767879803791750.wt","session_name":"WT_SESSION.checkpoint","category":"WT_VERB_DEFAULT","category_id":9,"verbose_level":"ERROR","verbose_level_id":-3,"msg":"__posix_file_read:435:/var/lib/mongo/collection-18--1935767879803791750.wt: handle-read: pread: failed to read 4096 bytes at offset 110682112","error_str":"WT_ERROR: non-specific WiredTiger error","error_code":-31802}}}
|
WiredTiger failed trying to read 4K of data at offset 110682112 in file /var/lib/mongo/collection-18--1935767879803791750.wt. The generic WT error, WT_ERROR, indicates the read was after the end of the file and returned zero bytes. WiredTiger should not be trying to read past the end of file. This suggest a couple of possibilities:
- The file has been corrupted somehow.
- Since we've been discussing memory corruption, possibly some internal state got overwritten leading to WT trying to read at a bogus address.
amit.gupta@opensense.com said that two nodes failed at the same time. If they both failed in this manner, I think that points to option #1. It seems unlikely to me that memory corruption would generate exactly the same type of failure of two separate nodes. But that's just my gut feeling.
So to confirm my reading of the failure, it would be useful to know if /var/lib/mongo/collection-18--1935767879803791750.wt is smaller than 110682112 bytes.
We did have a recent bug that had the potential to generate this error after upgrade – introduced by WT-5927 and fixed by WT-9457. But that happened in 6.1.x, and shouldn't be in play here.
I'll take a look at the core files etc. and see what jumps out at me.
|
|
I'm assigning to keith.smith@mongodb.com. It's unclear whether the most recent crash is the same problem. This stack trace seems to correspond a lot more closely to the storage end of things.
|
|
Thanks Amit, I'm taking a look today.
|
|
Here is the error snippet from the logs
{"t":{"$date":"2023-03-31T14:55:09.791-04:00"},"s":"E", "c":"WT", "id":22435, "ctx":"Checkpointer","msg":"WiredTiger error message","attr":{"error":-31802,"message":{"ts_sec":1680288909,"ts_usec":791377,"thread":"3764:0x7fbff99d6640","session_dhandle_name":"file:collection-18--1935767879803791750.wt","session_name":"WT_SESSION.checkpoint","category":"WT_VERB_DEFAULT","category_id":9,"verbose_level":"ERROR","verbose_level_id":-3,"msg":"__posix_file_read:435:/var/lib/mongo/collection-18--1935767879803791750.wt: handle-read: pread: failed to read 4096 bytes at offset 110682112","error_str":"WT_ERROR: non-specific WiredTiger error","error_code":-31802}}}
|
{"t":{"$date":"2023-03-31T14:55:09.791-04:00"},"s":"E", "c":"WT", "id":22435, "ctx":"Checkpointer","msg":"WiredTiger error message","attr":{"error":22,"message":{"ts_sec":1680288909,"ts_usec":791461,"thread":"3764:0x7fbff99d6640","session_dhandle_name":"file:collection-2--1935767879803791750.wt","session_name":"WT_SESSION.checkpoint","category":"WT_VERB_DEFAULT","category_id":9,"verbose_level":"ERROR","verbose_level_id":-3,"msg":"__wt_block_checkpoint_resolve:929:collection-2--1935767879803791750.wt: the checkpoint failed, the system must restart","error_str":"Invalid argument","error_code":22}}}
|
{"t":{"$date":"2023-03-31T14:55:09.791-04:00"},"s":"E", "c":"WT", "id":22435, "ctx":"Checkpointer","msg":"WiredTiger error message","attr":{"error":-31804,"message":{"ts_sec":1680288909,"ts_usec":791481,"thread":"3764:0x7fbff99d6640","session_dhandle_name":"file:collection-2--1935767879803791750.wt","session_name":"WT_SESSION.checkpoint","category":"WT_VERB_DEFAULT","category_id":9,"verbose_level":"ERROR","verbose_level_id":-3,"msg":"__wt_block_checkpoint_resolve:929:the process must exit and restart","error_str":"WT_PANIC: WiredTiger library panic","error_code":-31804}}}
|
{"t":{"$date":"2023-03-31T14:55:09.791-04:00"},"s":"F", "c":"ASSERT", "id":23089, "ctx":"Checkpointer","msg":"Fatal assertion","attr":{"msgid":50853,"file":"src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp","line":712}}
|
{"t":{"$date":"2023-03-31T14:55:09.791-04:00"},"s":"F", "c":"ASSERT", "id":23090, "ctx":"Checkpointer","msg":"\n\n***aborting after fassert() failure\n\n"}
|
{"t":{"$date":"2023-03-31T14:55:09.791-04:00"},"s":"F", "c":"CONTROL", "id":6384300, "ctx":"Checkpointer","msg":"Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).\n"}}
|
{"t":{"$date":"2023-03-31T14:55:09.876-04:00"},"s":"I", "c":"CONTROL", "id":31380, "ctx":"Checkpointer","msg":"BACKTRACE","attr":{"bt":{"backtrace":[{"a":"55BDDBB536A4","b":"55BDD6D41000","o":"4E126A4","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.362","C":"mongo::stack_trace_detail::(anonymous namespace)::printStackTraceImpl(mongo::stack_trace_detail::(anonymous namespace)::Options const&, mongo::StackTraceSink*) [clone .constprop.362]","s+":"1F4"},{"a":"55BDDBB55BE9","b":"55BDD6D41000","o":"4E14BE9","s":"_ZN5mongo15printStackTraceEv","C":"mongo::printStackTrace()","s+":"29"},{"a":"55BDDBB4F8C6","b":"55BDD6D41000","o":"4E0E8C6","s":"abruptQuit","s+":"66"},{"a":"7FC002C39520","b":"7FC002BF7000","o":"42520","s":"__sigaction","s+":"50"},{"a":"7FC002C8DA7C","b":"7FC002BF7000","o":"96A7C","s":"pthread_kill","s+":"12C"},{"a":"7FC002C39476","b":"7FC002BF7000","o":"42476","s":"raise","s+":"16"},{"a":"7FC002C1F7F3","b":"7FC002BF7000","o":"287F3","s":"abort","s+":"D3"},{"a":"55BDD8C45D62","b":"55BDD6D41000","o":"1F04D62","s":"_ZN5mongo25fassertFailedWithLocationEiPKcj","C":"mongo::fassertFailedWithLocation(int, char const*, unsigned int)","s+":"F6"},{"a":"55BDD87F6990","b":"55BDD6D41000","o":"1AB5990","s":"_ZN5mongo12_GLOBAL__N_141mdb_handle_error_with_startup_suppressionEP18__wt_event_handlerP12__wt_sessioniPKc.cold.1418","C":"mongo::(anonymous namespace)::mdb_handle_error_with_startup_suppression(__wt_event_handler*, __wt_session*, int, char const*) [clone .cold.1418]","s+":"16"},{"a":"55BDD97BAC81","b":"55BDD6D41000","o":"2A79C81","s":"__eventv","s+":"E61"},{"a":"55BDD880ABC7","b":"55BDD6D41000","o":"1AC9BC7","s":"__wt_panic_func","s+":"13A"},{"a":"55BDD881B83A","b":"55BDD6D41000","o":"1ADA83A","s":"__wt_block_checkpoint_resolve.cold.7","s+":"2B"},{"a":"55BDD977E472","b":"55BDD6D41000","o":"2A3D472","s":"__wt_meta_track_off","s+":"682"},{"a":"55BDD97DAAF5","b":"55BDD6D41000","o":"2A99AF5","s":"__txn_checkpoint_wrapper","s+":"1475"},{"a":"55BDD97DB9B6","b":"55BDD6D41000","o":"2A9A9B6","s":"__wt_txn_checkpoint","s+":"246"},{"a":"55BDD8809D72","b":"55BDD6D41000","o":"1AC8D72","s":"__session_checkpoint.cold.48","s+":"10A"},{"a":"55BDD96E0D74","b":"55BDD6D41000","o":"299FD74","s":"_ZN5mongo18WiredTigerKVEngine11_checkpointEP12__wt_session","C":"mongo::WiredTigerKVEngine::_checkpoint(__wt_session*)","s+":"174"},{"a":"55BDD96E3230","b":"55BDD6D41000","o":"29A2230","s":"_ZN5mongo18WiredTigerKVEngine10checkpointEv","C":"mongo::WiredTigerKVEngine::checkpoint()","s+":"40"},{"a":"55BDD9F26D74","b":"55BDD6D41000","o":"31E5D74","s":"_ZN5mongo12Checkpointer3runEv","C":"mongo::Checkpointer::run()","s+":"DD4"},{"a":"55BDDB934B16","b":"55BDD6D41000","o":"4BF3B16","s":"_ZN5mongo13BackgroundJob7jobBodyEv","C":"mongo::BackgroundJob::jobBody()","s+":"E6"},{"a":"55BDDB93519C","b":"55BDD6D41000","o":"4BF419C","s":"_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5mongo4stdx6threadC4IZNS3_13BackgroundJob2goEvEUlvE0_JELi0EEET_DpOT0_EUlvE_EEEEE6_M_runEv","C":"std::thread::_State_impl<std::thread::_Invoker<std::tuple<mongo::stdx::thread::thread<mongo::BackgroundJob::go()::{lambda()#2}, , 0>(mongo::BackgroundJob::go()::{lambda()#2})::{lambda()#1}> > >::_M_run()","s+":"7C"},{"a":"55BDDBCF6C5F","b":"55BDD6D41000","o":"4FB5C5F","s":"execute_native_thread_routine","s+":"F"},{"a":"7FC002C8BB43","b":"7FC002BF7000","o":"94B43","s":"pthread_condattr_setpshared","s+":"513"},{"a":"7FC002D1DA00","b":"7FC002BF7000","o":"126A00","s":"__xmknodat","s+":"230"}],"processInfo":{"mongodbVersion":"6.0.5","gitVersion":"c9a99c120371d4d4c52cbb15dac34a36ce8d3b1d","compiledModules":[],"uname":{"sysname":"Linux","release":"5.15.0-67-generic","version":"#74-Ubuntu SMP Wed Feb 22 14:14:39 UTC 2023","machine":"x86_64"},"somap":[{"b":"55BDD6D41000","elfType":3,"buildId":"3C976B679319B912CD7F7AC36B539DCD7F937343"},{"b":"7FC002BF7000","path":"/lib/x86_64-linux-gnu/libc.so.6","elfType":3,"buildId":"69389D485A9793DBE873F0EA2C93E02EFAA9AA3D"}]}}},"tags":[]}
|
{"t":{"$date":"2023-03-31T14:55:09.876-04:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"Checkpointer","msg":"Frame","attr":{"frame":{"a":"55BDDBB536A4","b":"55BDD6D41000","o":"4E126A4","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.362","C":"mongo::stack_trace_detail::(anonymous namespace)::printStackTraceImpl(mongo::stack_trace_detail::(anonymous namespace)::Options const&, mongo::StackTraceSink*) [clone .constprop.362]","s+":"1F4"}}}
|
{"t":{"$date":"2023-03-31T14:55:09.876-04:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"Checkpointer","msg":"Frame","attr":{"frame":{"a":"55BDDBB55BE9","b":"55BDD6D41000","o":"4E14BE9","s":"_ZN5mongo15printStackTraceEv","C":"mongo::printStackTrace()","s+":"29"}}}
|
{"t":{"$date":"2023-03-31T14:55:09.876-04:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"Checkpointer","msg":"Frame","attr":{"frame":{"a":"55BDDBB4F8C6","b":"55BDD6D41000","o":"4E0E8C6","s":"abruptQuit","s+":"66"}}}
|
{"t":{"$date":"2023-03-31T14:55:09.876-04:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"Checkpointer","msg":"Frame","attr":{"frame":{"a":"7FC002C39520","b":"7FC002BF7000","o":"42520","s":"__sigaction","s+":"50"}}}
|
{"t":{"$date":"2023-03-31T14:55:09.876-04:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"Checkpointer","msg":"Frame","attr":{"frame":{"a":"7FC002C8DA7C","b":"7FC002BF7000","o":"96A7C","s":"pthread_kill","s+":"12C"}}}
|
{"t":{"$date":"2023-03-31T14:55:09.876-04:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"Checkpointer","msg":"Frame","attr":{"frame":{"a":"7FC002C39476","b":"7FC002BF7000","o":"42476","s":"raise","s+":"16"}}}
|
|
|
The interesting thing is the 2 servers are in 2 different datacenters. They both look like they crashed at almost the same time exact time.
|
|
I've uploaded the diagnostics, coredumps and the mongod.log files for both servers.
|
|
ugh.. unfortunately, we just had 2 nodes in our 5-node replica set just crash at almost the same time. WIll be uploading data.
|
|
Hi Amit, it's still unclear as we are not able to reproduce this in 6.0.4. Looking at the commit logs, it's pretty hard to say. Plenty of of commits went in and given the nature of the crash appearing to be some sort of memory corruption, the root cause is hard to diagnose.
I am keeping this one open until we wrap up some conversations on our end if that's ok with you. We don't want to ignore that this happened.
|
|
Thanks Matt. Do you have any guesses on which bug fix(es) in 6.0.5 may be the one that has addressed our crashes? Just based on the logs and crash info. Just curious. I think we can close out this ticket as being resolved based on 6.0.5.
|
|
Thanks for the update, Amit! I'm glad to hear all is working well!
|
|
quick update.. since the 6.0.5 patch, the crashes have not happened again. I don't want to jinx it, but it seems resolved.
|
|
Thanks for the update. That's really good to hear! It's important that your issue gets resolved. Definitely keep monitoring and send an update if anything comes up. If you get back to us early next week, we'll definitely be able to follow-up.
We're going to follow-up internally on trying to understand why this happens in 6.0.4.
|
|
We haven't seen a recurrence of any mongod or mongos issues since we have updated to 6.0.5. However, we are still keeping a close eye since we have gone for a day or 2 in the past without issue. If there is no issue going through early next week, it would seem that 6.0.5 resolved the problem we were having.
|
|
Hopefully you won't need to compile it. Using our build infrastructure is more likely to generate something that's in-line with a production build. I suspect it will be doable though if it comes down to it, and your willingness to help in that way is very appreciated! I'll keep you posted.
|
|
Thanks! If it's easier to provide a git/source branch for the instrumented build rather than a compiled version, then we can compile it on the servers.
|
|
We'll definitely continue to try to open any core dumps that you send over. Unfortunately I've only been able to open one of the core dumps so far, which is why instrumentation might end up being helpful.
There are some logistics that we will need to figure out internally before we could distribute an instrumented build. If it's not possible, I'll pursue other ways for us to dig deeper into the root cause.
|
|
Thanks Matt, I got your note about installing the custom build with more instrumentation. The speed slowdown should not be an issue. Here is how we are planning our next steps:
1) Allow 6.0.5 to run. If we don't see a recurrence of the issue, then great!
2) If we see a recurrence of the issue, then for ever server that has a crash, we will send you the current coredump and then install your instrumented version on that node to wait for it to happen again.
(For some reason I do not see your note about how to setup the instrumented version in the comment chain above). I have it in my email though.
|
|
Hi Amit,
Thanks for keeping me updated. It'll be interesting to see how 6.0.5 goes. We're continuing to investigate on our end and coordinate how we can move forward on diagnosing this issue with you. I'll make sure to update you as soon as I have more info.
|
|
Hi Matt, we have upgraded all mongod and mongos instances to 6.0.5 and restarted them all.. Let's see how the next couple days work out. If you need us to run a custom branch with more instrumentation, we are happy to do so!
|
|
Sounds good. Happy to do whatever. We can compile from source as well. Also happy to hop on a screen share if you think it will be helpful.
|
|
Thanks for the update, and good to know. So far, it seems every crash is giving a different backtrace. This suggests to me it's something that will be difficult to diagnose without some added instrumentation. We are currently coming up with some suggestions for gathering more diagnostics.
|
|
Thank you. Any node that is getting a core dump, we are upgrading it to 6.0.5 now.. later tonight we plan to upgrade everything to 6.0.5 just so we can start tomorrow fresh with the latest patched version.
|
|
Thanks amit.gupta@opensense.com. I'll look into this one too. Will make sure to get back to you if I have more info or recommendations.
|
|
another core dump on another mongod server.. uploading crash4-daldb1.zip
|
|
just had another secondary in replicaset coredump. uploaded log, coredump and diagnostics as crash3-daldb2.zip
|
|
Thanks Matt.. does that fact that we are using systemd-coredump have any bearing on whether you would be able to read the coredump? I'm not sure if it changes something in the formatting. I have almost no experience with looking at core dumps, so apologies if it's a dumb question.
|
|
Just saw your comment about mongod. Haven't checked that dump yet. I'm on it now.
|
|
I took a look at the core dumps. I was able to open the larger of the two and it doesn't associate with the crash in question. GDB can't make sense of the smaller one (assuming it came from a 22.04 ubuntu x86_64 instance running mongo community).
amit.gupta@opensense.com were there more dumps? Any that you know to be associated with the invariant in question?
|
|
have a mongod crash in a replicaset just now. Will upload the core dump, mongod.log and diagnostics now.
|
|
Thanks Matt. Just an FYI, we made a couple updates to help reduce the chances of hitting these segfaults.
- We optimized a certain process that was doing a ton of updateOne operations. It should be doing about 99% less write operations.
- We have a 5-node replica set for the config server. One of the secondary nodes for the config RS was running on the same server as the mongos that crashed. (It's a pretty fast Intel(R) Xeon(R) Platinum 8260Y CPU dual socket, 96 threads machine). We moved the config RS member off of the server. We are just shooting in the dark in case there might be some race condition triggered because the config secondary is running on the same server as the mongos.
|
|
Just getting an update out: I'm getting a spawnhost set up to look at the dumps. I've seen this invariant before actually.
In the previous case it was due to a Future with an incorrect SharedState pointer, which was caused by an incorrect capture by reference.
I will know more once I can get the dumps opened up.
|
- Other notes, these core dumps happened around the same time.
- We are at FCV 6.0 throughout the cluster
This is just speculation, but I feel like the crashes happen slightly more on servers that have dual sockets. It seems like some sort of race condition may be getting trigger.
|
|
I was looking through some other servers to see if there are other coredump files. I found 2 more and uploaded them.
One for mongos and another for mongosh. Both of these dumps are on different servers than the first mongos provided.
|
|
I found the coredump file in /var/lib/systemd/coredump/core.mongos.125.55df9f5d165e41be91c3118727fd24c1.171812.1678478198000000.zst
I've zipped and uploaded it to the upload link provided before. Thanks!
|
|
I’ll check ulimit settings when I’m in the office. Thanks for helping out on this!
|
|
Hi Jason, I did not see any core dump file in /var/crash. I’m not sure why not though. Are you aware of any setting I should enable to make core dumps get produced?
|
|
hi amit.gupta@opensense.com, thanks for the logs and diagnostic data. Would you have core dumps available or be able to capture them for any future crashes? If so, chris.kelly@mongodb.com, would you be able to create another secure portal for Amit to upload that data?
|
|
Another crash of mongos router today:
{"t":{"$date":"2023-03-10T14:56:38.042-05:00"},"s":"I", "c":"CONTROL", "id":31380, "ctx":"conn19293","msg":"BACKTRACE","attr":{"bt":{"backtrace":[{"a":"5558A0239214","b":"55589CB2A000","o":"370F214","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.362","C":"mongo::stack_trace_detail::(anonymous namespace)::printStackTraceImpl(mongo::stack_trace_detail::(anonymous namespace)::Options const&, mongo::StackTraceSink*) [clone .constprop.362]","s+":"1F4"},{"a":"5558A023B759","b":"55589CB2A000","o":"3711759","s":"_ZN5mongo15printStackTraceEv","C":"mongo::printStackTrace()","s+":"29"},{"a":"5558A023558C","b":"55589CB2A000","o":"370B58C","s":"abruptQuitWithAddrSignal","s+":"EC"},{"a":"7FABECB10520","b":"7FABECACE000","o":"42520","s":"__sigaction","s+":"50"},{"a":"5558A008BB10","b":"55589CB2A000","o":"3561B10","s":"_ZNK5mongo16OperationContext18hasDeadlineExpiredEv","C":"mongo::OperationContext::hasDeadlineExpired() const","s+":"B0"},{"a":"5558A008D3FF","b":"55589CB2A000","o":"35633FF","s":"_ZN5mongo16OperationContext25checkForInterruptNoAssertEv","C":"mongo::OperationContext::checkForInterruptNoAssert()","s+":"7F"},{"a":"55589E449440","b":"55589CB2A000","o":"191F440","s":"_ZZN5mongo13Interruptible32waitForConditionOrInterruptUntilISt11unique_lockISt5mutexEZNS_14future_details15SharedStateBase4waitEPS0_EUlvE_EEbRNS_4stdx18condition_variableERT_NS_6Date_tET0_ENKUlNS0_9WakeSpeedEE0_clESG_","s+":"60"},{"a":"55589E44962D","b":"55589CB2A000","o":"191F62D","s":"_ZZN5mongo13Interruptible32waitForConditionOrInterruptUntilISt11unique_lockISt5mutexEZNS_14future_details15SharedStateBase4waitEPS0_EUlvE_EEbRNS_4stdx18condition_variableERT_NS_6Date_tET0_ENKUlSE_NS0_9WakeSpeedEE1_clESE_SG_","s+":"BD"},{"a":"55589E465FC5","b":"55589CB2A000","o":"193BFC5","s":"_ZN5mongo14future_details15SharedStateBase4waitEPNS_13InterruptibleE","C":"mongo::future_details::SharedStateBase::wait(mongo::Interruptible*)","s+":"175"},{"a":"55589E5BEAD8","b":"55589CB2A000","o":"1A94AD8","s":"_ZNK5mongo25MongosTopologyCoordinator18awaitHelloResponseEPNS_16OperationContextEN5boost8optionalINS_15TopologyVersionEEENS4_INS_6Date_tEEE","C":"mongo::MongosTopologyCoordinator::awaitHelloResponse(mongo::OperationContext*, boost::optional<mongo::TopologyVersion>, boost::optional<mongo::Date_t>) const","s+":"2F8"},{"a":"55589E519850","b":"55589CB2A000","o":"19EF850","s":"_ZN5mongo12_GLOBAL__N_18CmdHello19runWithReplyBuilderEPNS_16OperationContextERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS_7BSONObjEPNS_3rpc21ReplyBuilderInterfaceE","C":"mongo::(anonymous namespace)::CmdHello::runWithReplyBuilder(mongo::OperationContext*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, mongo::BSONObj const&, mongo::rpc::ReplyBuilderInterface*)","s+":"200"},{"a":"55589F9ADA7E","b":"55589CB2A000","o":"2E83A7E","s":"_ZN5mongo37BasicCommandWithReplyBuilderInterface10Invocation3runEPNS_16OperationContextEPNS_3rpc21ReplyBuilderInterfaceE","C":"mongo::BasicCommandWithReplyBuilderInterface::Invocation::run(mongo::OperationContext*, mongo::rpc::ReplyBuilderInterface*)","s+":"BE"},{"a":"55589F9A54C1","b":"55589CB2A000","o":"2E7B4C1","s":"_ZN5mongo14CommandHelpers20runCommandInvocationEPNS_16OperationContextERKNS_12OpMsgRequestEPNS_17CommandInvocationEPNS_3rpc21ReplyBuilderInterfaceE","C":"mongo::CommandHelpers::runCommandInvocation(mongo::OperationContext*, mongo::OpMsgRequest const&, mongo::CommandInvocation*, mongo::rpc::ReplyBuilderInterface*)","s+":"61"},{"a":"55589F9A9539","b":"55589CB2A000","o":"2E7F539","s":"_ZN5mongo14CommandHelpers20runCommandInvocationESt10shared_ptrINS_23RequestExecutionContextEES1_INS_17CommandInvocationEENS_9transport15ServiceExecutor14ThreadingModelE","C":"mongo::CommandHelpers::runCommandInvocation(std::shared_ptr<mongo::RequestExecutionContext>, std::shared_ptr<mongo::CommandInvocation>, mongo::transport::ServiceExecutor::ThreadingModel)","s+":"1C9"},{"a":"55589E57BDC7","b":"55589CB2A000","o":"1A51DC7","s":"_ZN5mongo12_GLOBAL__N_120runCommandInvocationESt10shared_ptrINS_23RequestExecutionContextEES1_INS_17CommandInvocationEE","C":"mongo::(anonymous namespace)::runCommandInvocation(std::shared_ptr<mongo::RequestExecutionContext>, std::shared_ptr<mongo::CommandInvocation>)","s+":"97"},{"a":"55589E580335","b":"55589CB2A000","o":"1A56335","s":"_ZN5mongo12_GLOBAL__N_117ExecCommandClient3runEv","C":"mongo::(anonymous namespace)::ExecCommandClient::run()","s+":"605"},{"a":"55589E580B6E","b":"55589CB2A000","o":"1A56B6E","s":"_ZN5mongo12_GLOBAL__N_118ParseAndRunCommand11RunAndRetry3runEv","C":"mongo::(anonymous namespace)::ParseAndRunCommand::RunAndRetry::run()","s+":"3EE"},{"a":"55589E584E3C","b":"55589CB2A000","o":"1A5AE3C","s":"_ZN5mongo13ClientCommand8_executeEv","C":"mongo::ClientCommand::_execute()","s+":"3BC"},{"a":"55589E586371","b":"55589CB2A000","o":"1A5C371","s":"_ZN5mongo13ClientCommand3runEv","C":"mongo::ClientCommand::run()","s+":"41"},{"a":"55589E586A4C","b":"55589CB2A000","o":"1A5CA4C","s":"_ZN5mongo19makeReadyFutureWithIZNOS_11future_util10AsyncStateINS_13ClientCommandEE13thenWithStateIZNS_8Strategy13clientCommandESt10shared_ptrINS_23RequestExecutionContextEEEUlPT_E_EEDaOSA_EUlvE_Li0EEENS_6FutureINS_14future_details17UnwrappedTypeImplINSt13invoke_resultISD_JEE4typeEE4typeEEESD_","C":"mongo::Future<mongo::future_details::UnwrappedTypeImpl<std::invoke_result<mongo::Strategy::clientCommand(std::shared_ptr<mongo::RequestExecutionContext>)::{lambda(auto:1*)#1}>::type>::type> mongo::makeReadyFutureWith<mongo::future_util::AsyncState<mongo::ClientCommand>::thenWithState<mongo::Strategy::clientCommand(std::shared_ptr<mongo::RequestExecutionContext>)::{lambda(auto:1*)#1}>(mongo::Strategy::clientCommand(std::shared_ptr<mongo::RequestExecutionContext>)::{lambda(auto:1*)#1}&&) &&::{lambda()#1}, 0>(mongo::Strategy::clientCommand(std::shared_ptr<mongo::RequestExecutionContext>)::{lambda(auto:1*)#1})","s+":"3C"},{"a":"55589E587252","b":"55589CB2A000","o":"1A5D252","s":"_ZN5mongo8Strategy13clientCommandESt10shared_ptrINS_23RequestExecutionContextEE","C":"mongo::Strategy::clientCommand(std::shared_ptr<mongo::RequestExecutionContext>)","s+":"192"},{"a":"55589E498011","b":"55589CB2A000","o":"196E011","s":"_ZN5mongo13HandleRequest13handleRequestEv","C":"mongo::HandleRequest::handleRequest()","s+":"111"},{"a":"55589E498538","b":"55589CB2A000","o":"196E538","s":"_ZZN5mongo15unique_functionIFvPNS_14future_details15SharedStateBaseEEE8makeImplIZNS1_10FutureImplINS1_8FakeVoidEE16makeContinuationINS_10DbResponseEZZNOS9_4thenIZNS_13HandleRequest3runEvEUlvE0_EEDaOT_ENKUlvE1_clEvEUlPNS1_15SharedStateImplIS8_EEPNSI_ISB_EEE_EENS7_ISF_EEOT0_EUlS3_E_EEDaSG_EN12SpecificImpl4callEOS3_","C":"mongo::unique_function<void (mongo::future_details::SharedStateBase*)>::makeImpl<mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::makeContinuation<mongo::DbResponse, mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::then<mongo::HandleRequest::run()::{lambda()#2}>(mongo::HandleRequest::run()::{lambda()#2}&&) &&::{lambda()#3}::operator()() const::{lambda(mongo::future_details::SharedStateImpl<mongo::future_details::FakeVoid>*, {lambda()#3}<mongo::DbResponse>*)#1}>(mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::then<mongo::HandleRequest::run()::{lambda()#2}>(mongo::HandleRequest::run()::{lambda()#2}&&) &&::{lambda()#3}::operator()() const::{lambda(mongo::future_details::SharedStateImpl<mongo::future_details::FakeVoid>*, {lambda()#3}<mongo::DbResponse>*)#1}&&)::{lambda(mongo::future_details::SharedStateBase*)#1}>(mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::makeContinuation<mongo::DbResponse, mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::then<mongo::HandleRequest::run()::{lambda()#2}>(mongo::HandleRequest::run()::{lambda()#2}&&) &&::{lambda()#3}::operator()() const::{lambda(mongo::future_details::SharedStateImpl<mongo::future_details::FakeVoid>*, {lambda()#3}<mongo::DbResponse>*)#1}>(mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::then<mongo::HandleRequest::run()::{lambda()#2}>(mongo::HandleRequest::run()::{lambda()#2}&&) &&::{lambda()#3}::operator()() const::{lambda(mongo::future_details::SharedStateImpl<mongo::future_details::FakeVoid>*, {lambda()#3}<mongo::DbResponse>*)#1}&&)::{lambda(mongo::future_details::SharedStateBase*)#1})::SpecificImpl::call(mongo::future_details::SharedStateBase*&&)","s+":"88"},{"a":"55589E467BB7","b":"55589CB2A000","o":"193DBB7","s":"_ZN5mongo14future_details15SharedStateBase20transitionToFinishedEv","C":"mongo::future_details::SharedStateBase::transitionToFinished()","s+":"1B7"},{"a":"55589E467BB7","b":"55589CB2A000","o":"193DBB7","s":"_ZN5mongo14future_details15SharedStateBase20transitionToFinishedEv","C":"mongo::future_details::SharedStateBase::transitionToFinished()","s+":"1B7"},{"a":"55589E498E9F","b":"55589CB2A000","o":"196EE9F","s":"_ZN5mongo13HandleRequest3runEv","C":"mongo::HandleRequest::run()","s+":"3CF"},{"a":"55589E499D09","b":"55589CB2A000","o":"196FD09","s":"_ZN5mongo23ServiceEntryPointMongos17handleRequestImplEPNS_16OperationContextERKNS_7MessageE","C":"mongo::ServiceEntryPointMongos::handleRequestImpl(mongo::OperationContext*, mongo::Message const&)","s+":"1F9"},{"a":"55589E499F16","b":"55589CB2A000","o":"196FF16","s":"_ZN5mongo23ServiceEntryPointMongos13handleRequestEPNS_16OperationContextERKNS_7MessageE","C":"mongo::ServiceEntryPointMongos::handleRequest(mongo::OperationContext*, mongo::Message const&)","s+":"26"},{"a":"55589E7CCC47","b":"55589CB2A000","o":"1CA2C47","s":"_ZN5mongo9transport19ServiceStateMachine4Impl14processMessageEv","C":"mongo::transport::ServiceStateMachine::Impl::processMessage()","s+":"127"},{"a":"55589E7CE956","b":"55589CB2A000","o":"1CA4956","s":"_ZN5mongo9transport19ServiceStateMachine4Impl12startNewLoopERKNS_6StatusE","C":"mongo::transport::ServiceStateMachine::Impl::startNewLoop(mongo::Status const&)","s+":"76"},{"a":"55589E7CEDEF","b":"55589CB2A000","o":"1CA4DEF","s":"_ZZN5mongo15unique_functionIFvNS_6StatusEEE8makeImplIZNS_9transport19ServiceStateMachine4Impl15scheduleNewLoopES1_EUlS1_E_EEDaOT_EN12SpecificImpl4callEOS1_","C":"mongo::unique_function<void (mongo::Status)>::makeImpl<mongo::transport::ServiceStateMachine::Impl::scheduleNewLoop(mongo::Status)::{lambda(mongo::Status)#1}>(mongo::transport::ServiceStateMachine::Impl::scheduleNewLoop(mongo::Status)::{lambda(mongo::Status)#1}&&)::SpecificImpl::call(mongo::Status&&)","s+":"7F"},{"a":"55589FA9D393","b":"55589CB2A000","o":"2F73393","s":"_ZZN5mongo15unique_functionIFvvEE8makeImplIZNS_9transport15ServiceExecutor8scheduleENS0_IFvNS_6StatusEEEEEUlvE_EEDaOT_EN12SpecificImpl4callEv","C":"mongo::unique_function<void ()>::makeImpl<mongo::transport::ServiceExecutor::schedule(mongo::unique_function<void (mongo::Status)>)::{lambda()#1}>(mongo::transport::ServiceExecutor::schedule(mongo::unique_function<void (mongo::Status)>)::{lambda()#1}&&)::SpecificImpl::call()","s+":"33"},{"a":"55589FAA368B","b":"55589CB2A000","o":"2F7968B","s":"_ZZN5mongo15unique_functionIFvvEE8makeImplIZNS_9transport26ServiceExecutorSynchronous12scheduleTaskES2_NS4_15ServiceExecutor13ScheduleFlagsEEUlvE0_EEDaOT_EN12SpecificImpl4callEv","C":"mongo::unique_function<void ()>::makeImpl<mongo::transport::ServiceExecutorSynchronous::scheduleTask(mongo::unique_function<void ()>, mongo::transport::ServiceExecutor::ScheduleFlags)::{lambda()#2}>(mongo::transport::ServiceExecutorSynchronous::scheduleTask(mongo::unique_function<void ()>, mongo::transport::ServiceExecutor::ScheduleFlags)::{lambda()#2}&&)::SpecificImpl::call()","s+":"BB"},{"a":"55589FAA5610","b":"55589CB2A000","o":"2F7B610","s":"_ZZN5mongo15unique_functionIFvvEE8makeImplIZNS_25launchServiceWorkerThreadES2_EUlvE2_EEDaOT_EN12SpecificImpl4callEv","C":"mongo::unique_function<void ()>::makeImpl<mongo::launchServiceWorkerThread(mongo::unique_function<void ()>)::{lambda()#4}>(mongo::launchServiceWorkerThread(mongo::unique_function<void ()>)::{lambda()#4}&&)::SpecificImpl::call()","s+":"80"},{"a":"55589FAA56BC","b":"55589CB2A000","o":"2F7B6BC","s":"_ZN5mongo12_GLOBAL__N_17runFuncEPv","C":"mongo::(anonymous namespace)::runFunc(void*)","s+":"1C"},{"a":"7FABECB62B43","b":"7FABECACE000","o":"94B43","s":"pthread_condattr_setpshared","s+":"513"},{"a":"7FABECBF4A00","b":"7FABECACE000","o":"126A00","s":"__xmknodat","s+":"230"}],"processInfo":{"mongodbVersion":"6.0.4","gitVersion":"44ff59461c1353638a71e710f385a566bcd2f547","compiledModules":[],"uname":{"sysname":"Linux","release":"5.15.0-67-generic","version":"#74-Ubuntu SMP Wed Feb 22 14:14:39 UTC 2023","machine":"x86_64"},"somap":[{"b":"55589CB2A000","elfType":3,"buildId":"2A25DCB1B8E1FE370722E05329549F78EE3622C8"},{"b":"7FABECACE000","path":"/lib/x86_64-linux-gnu/libc.so.6","elfType":3,"buildId":"69389D485A9793DBE873F0EA2C93E02EFAA9AA3D"}]}}},"tags":[]}
|
{"t":{"$date":"2023-03-10T14:56:38.043-05:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"conn19293","msg":"Frame","attr":{"frame":{"a":"5558A0239214","b":"55589CB2A000","o":"370F214","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.362","C":"mongo::stack_trace_detail::(anonymous namespace)::printStackTraceImpl(mongo::stack_trace_detail::(anonymous namespace)::Options const&, mongo::StackTraceSink*) [clone .constprop.362]","s+":"1F4"}}}
|
{"t":{"$date":"2023-03-10T14:56:38.043-05:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"conn19293","msg":"Frame","attr":{"frame":{"a":"5558A023B759","b":"55589CB2A000","o":"3711759","s":"_ZN5mongo15printStackTraceEv","C":"mongo::printStackTrace()","s+":"29"}}}
|
{"t":{"$date":"2023-03-10T14:56:38.043-05:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"conn19293","msg":"Frame","attr":{"frame":{"a":"5558A023558C","b":"55589CB2A000","o":"370B58C","s":"abruptQuitWithAddrSignal","s+":"EC"}}}
|
{"t":{"$date":"2023-03-10T14:56:38.043-05:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"conn19293","msg":"Frame","attr":{"frame":{"a":"7FABECB10520","b":"7FABECACE000","o":"42520","s":"__sigaction","s+":"50"}}}
|
{"t":{"$date":"2023-03-10T14:56:38.043-05:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"conn19293","msg":"Frame","attr":{"frame":{"a":"5558A008BB10","b":"55589CB2A000","o":"3561B10","s":"_ZNK5mongo16OperationContext18hasDeadlineExpiredEv","C":"mongo::OperationContext::hasDeadlineExpired() const","s+":"B0"}}}
|
{"t":{"$date":"2023-03-10T14:56:38.043-05:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"conn19293","msg":"Frame","attr":{"frame":{"a":"5558A008D3FF","b":"55589CB2A000","o":"35633FF","s":"_ZN5mongo16OperationContext25checkForInterruptNoAssertEv","C":"mongo::OperationContext::checkForInterruptNoAssert()","s+":"7F"}}}
|
{"t":{"$date":"2023-03-10T14:56:38.043-05:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"conn19293","msg":"Frame","attr":{"frame":{"a":"55589E449440","b":"55589CB2A000","o":"191F440","s":"_ZZN5mongo13Interruptible32waitForConditionOrInterruptUntilISt11unique_lockISt5mutexEZNS_14future_details15SharedStateBase4waitEPS0_EUlvE_EEbRNS_4stdx18condition_variableERT_NS_6Date_tET0_ENKUlNS0_9WakeSpeedEE0_clESG_","s+":"60"}}}
|
{"t":{"$date":"2023-03-10T14:56:38.043-05:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"conn19293","msg":"Frame","attr":{"frame":{"a":"55589E44962D","b":"55589CB2A000","o":"191F62D","s":"_ZZN5mongo13Interruptible32waitForConditionOrInterruptUntilISt11unique_lockISt5mutexEZNS_14future_details15SharedStateBase4waitEPS0_EUlvE_EEbRNS_4stdx18condition_variableERT_NS_6Date_tET0_ENKUlSE_NS0_9WakeSpeedEE1_clESE_SG_","s+":"BD"}}}
|
{"t":{"$date":"2023-03-10T14:56:38.043-05:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"conn19293","msg":"Frame","attr":{"frame":{"a":"55589E465FC5","b":"55589CB2A000","o":"193BFC5","s":"_ZN5mongo14future_details15SharedStateBase4waitEPNS_13InterruptibleE","C":"mongo::future_details::SharedStateBase::wait(mongo::Interruptible*)","s+":"175"}}}
|
{"t":{"$date":"2023-03-10T14:56:38.043-05:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"conn19293","msg":"Frame","attr":{"frame":{"a":"55589E5BEAD8","b":"55589CB2A000","o":"1A94AD8","s":"_ZNK5mongo25MongosTopologyCoordinator18awaitHelloResponseEPNS_16OperationContextEN5boost8optionalINS_15TopologyVersionEEENS4_INS_6Date_tEEE","C":"mongo::MongosTopologyCoordinator::awaitHelloResponse(mongo::OperationContext*, boost::optional<mongo::TopologyVersion>, boost::optional<mongo::Date_t>) const","s+":"2F8"}}}
|
|
|
Just had a second crash of the same server. Uploaded logs and diagnostics to mongod-replicaset_member_shard1-second-crash.tar.gz
Here is the log snippet in case it's useful for SEO or someone else.
{"t":{"$date":"2023-03-09T15:08:59.312-05:00"},"s":"F", "c":"CONTROL", "id":6384300, "ctx":"TimestampMonitor","msg":"Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).\n"}}
|
{"t":{"$date":"2023-03-09T15:08:59.355-05:00"},"s":"E", "c":"CONTROL", "id":31430, "ctx":"TimestampMonitor","msg":"Error collecting stack trace","attr":{"error":"unw_get_proc_name(7F2F7051D71B): no unwind info found\nunw_get_proc_name(7F2F7051D71B): no unwind info found\n"}}
|
{"t":{"$date":"2023-03-09T15:08:59.355-05:00"},"s":"I", "c":"CONTROL", "id":31380, "ctx":"TimestampMonitor","msg":"BACKTRACE","attr":{"bt":{"backtrace":[{"a":"55CAFE8668D4","b":"55CAF9A88000","o":"4DDE8D4","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.362","C":"mongo::stack_trace_detail::(anonymous namespace)::printStackTraceImpl(mongo::stack_trace_detail::(anonymous namespace)::Options const&, mongo::StackTraceSink*) [clone .constprop.362]","s+":"1F4"},{"a":"55CAFE868E19","b":"55CAF9A88000","o":"4DE0E19","s":"_ZN5mongo15printStackTraceEv","C":"mongo::printStackTrace()","s+":"29"},{"a":"55CAFE862AF6","b":"55CAF9A88000","o":"4DDAAF6","s":"abruptQuit","s+":"66"},{"a":"7F2F70537520","b":"7F2F704F5000","o":"42520","s":"__sigaction","s+":"50"},{"a":"7F2F7058BA7C","b":"7F2F704F5000","o":"96A7C","s":"pthread_kill","s+":"12C"},{"a":"7F2F70537476","b":"7F2F704F5000","o":"42476","s":"raise","s+":"16"},{"a":"7F2F7051D7F3","b":"7F2F704F5000","o":"287F3","s":"abort","s+":"D3"},{"a":"7F2F7051D71B","b":"7F2F704F5000","o":"2871B"},{"a":"7F2F7052EE96","b":"7F2F704F5000","o":"39E96","s":"__assert_fail","s+":"46"},{"a":"7F2F7058C201","b":"7F2F704F5000","o":"97201","s":"pthread_kill_other_threads_np","s+":"741"},{"a":"7F2F70588E05","b":"7F2F704F5000","o":"93E05","s":"pthread_cond_timedwait","s+":"125"},{"a":"55CAFE6E3D50","b":"55CAF9A88000","o":"4C5BD50","s":"_ZN5mongo11ClockSource21waitForConditionUntilERNS_4stdx18condition_variableENS_20BasicLockableAdapterENS_6Date_tEPNS_8WaitableE","C":"mongo::ClockSource::waitForConditionUntil(mongo::stdx::condition_variable&, mongo::BasicLockableAdapter, mongo::Date_t, mongo::Waitable*)","s+":"5F0"},{"a":"55CAFC241AB2","b":"55CAF9A88000","o":"27B9AB2","s":"_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5mongo4stdx6threadC4IZNS3_18PeriodicRunnerImpl15PeriodicJobImpl4_runEvEUlvE0_JELi0EEET_DpOT0_EUlvE_EEEEE6_M_runEv","C":"std::thread::_State_impl<std::thread::_Invoker<std::tuple<mongo::stdx::thread::thread<mongo::PeriodicRunnerImpl::PeriodicJobImpl::_run()::{lambda()#2}, , 0>(mongo::PeriodicRunnerImpl::PeriodicJobImpl::_run()::{lambda()#2})::{lambda()#1}> > >::_M_run()","s+":"782"},{"a":"55CAFEA09E8F","b":"55CAF9A88000","o":"4F81E8F","s":"execute_native_thread_routine","s+":"F"},{"a":"7F2F70589B43","b":"7F2F704F5000","o":"94B43","s":"pthread_condattr_setpshared","s+":"513"},{"a":"7F2F7061BA00","b":"7F2F704F5000","o":"126A00","s":"__xmknodat","s+":"230"}],"processInfo":{"mongodbVersion":"6.0.4","gitVersion":"44ff59461c1353638a71e710f385a566bcd2f547","compiledModules":[],"uname":{"sysname":"Linux","release":"5.15.0-67-generic","version":"#74-Ubuntu SMP Wed Feb 22 14:14:39 UTC 2023","machine":"x86_64"},"somap":[{"b":"55CAF9A88000","elfType":3,"buildId":"BCB8F6C3AD4ABF41C81882BAE25C6885C6266CC4"},{"b":"7F2F704F5000","path":"/lib/x86_64-linux-gnu/libc.so.6","elfType":3,"buildId":"69389D485A9793DBE873F0EA2C93E02EFAA9AA3D"}]}}},"tags":[]}
|
{"t":{"$date":"2023-03-09T15:08:59.355-05:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"TimestampMonitor","msg":"Frame","attr":{"frame":{"a":"55CAFE8668D4","b":"55CAF9A88000","o":"4DDE8D4","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.362","C":"mongo::stack_trace_detail::(anonymous namespace)::printStackTraceImpl(mongo::stack_trace_detail::(anonymous namespace)::Options const&, mongo::StackTraceSink*) [clone .constprop.362]","s+":"1F4"}}}
|
{"t":{"$date":"2023-03-09T15:08:59.355-05:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"TimestampMonitor","msg":"Frame","attr":{"frame":{"a":"55CAFE868E19","b":"55CAF9A88000","o":"4DE0E19","s":"_ZN5mongo15printStackTraceEv","C":"mongo::printStackTrace()","s+":"29"}}}
|
{"t":{"$date":"2023-03-09T15:08:59.355-05:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"TimestampMonitor","msg":"Frame","attr":{"frame":{"a":"55CAFE862AF6","b":"55CAF9A88000","o":"4DDAAF6","s":"abruptQuit","s+":"66"}}}
|
{"t":{"$date":"2023-03-09T15:08:59.355-05:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"TimestampMonitor","msg":"Frame","attr":{"frame":{"a":"7F2F70537520","b":"7F2F704F5000","o":"42520","s":"__sigaction","s+":"50"}}}
|
{"t":{"$date":"2023-03-09T15:08:59.355-05:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"TimestampMonitor","msg":"Frame","attr":{"frame":{"a":"7F2F7058BA7C","b":"7F2F704F5000","o":"96A7C","s":"pthread_kill","s+":"12C"}}}
|
{"t":{"$date":"2023-03-09T15:08:59.355-05:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"TimestampMonitor","msg":"Frame","attr":{"frame":{"a":"7F2F70537476","b":"7F2F704F5000","o":"42476","s":"raise","s+":"16"}}}
|
{"t":{"$date":"2023-03-09T15:08:59.355-05:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"TimestampMonitor","msg":"Frame","attr":{"frame":{"a":"7F2F7051D7F3","b":"7F2F704F5000","o":"287F3","s":"abort","s+":"D3"}}}
|
{"t":{"$date":"2023-03-09T15:08:59.355-05:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"TimestampMonitor","msg":"Frame","attr":{"frame":{"a":"7F2F7051D71B","b":"7F2F704F5000","o":"2871B"}}}
|
{"t":{"$date":"2023-03-09T15:08:59.355-05:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"TimestampMonitor","msg":"Frame","attr":{"frame":{"a":"7F2F7052EE96","b":"7F2F704F5000","o":"39E96","s":"__assert_fail","s+":"46"}}}
|
{"t":{"$date":"2023-03-09T15:08:59.355-05:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"TimestampMonitor","msg":"Frame","attr":{"frame":{"a":"7F2F7058C201","b":"7F2F704F5000","o":"97201","s":"pthread_kill_other_threads_np","s+":"741"}}}
|
{"t":{"$date":"2023-03-09T15:08:59.355-05:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"TimestampMonitor","msg":"Frame","attr":{"frame":{"a":"7F2F70588E05","b":"7F2F704F5000","o":"93E05","s":"pthread_cond_timedwait","s+":"125"}}}
|
{"t":{"$date":"2023-03-09T15:08:59.355-05:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"TimestampMonitor","msg":"Frame","attr":{"frame":{"a":"55CAFE6E3D50","b":"55CAF9A88000","o":"4C5BD50","s":"_ZN5mongo11ClockSource21waitForConditionUntilERNS_4stdx18condition_variableENS_20BasicLockableAdapterENS_6Date_tEPNS_8WaitableE","C":"mongo::ClockSource::waitForConditionUntil(mongo::stdx::condition_variable&, mongo::BasicLockableAdapter, mongo::Date_t, mongo::Waitable*)","s+":"5F0"}}}
|
{"t":{"$date":"2023-03-09T15:08:59.355-05:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"TimestampMonitor","msg":"Frame","attr":{"frame":{"a":"55CAFC241AB2","b":"55CAF9A88000","o":"27B9AB2","s":"_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5mongo4stdx6threadC4IZNS3_18PeriodicRunnerImpl15PeriodicJobImpl4_runEvEUlvE0_JELi0EEET_DpOT0_EUlvE_EEEEE6_M_runEv","C":"std::thread::_State_impl<std::thread::_Invoker<std::tuple<mongo::stdx::thread::thread<mongo::PeriodicRunnerImpl::PeriodicJobImpl::_run()::{lambda()#2}, , 0>(mongo::PeriodicRunnerImpl::PeriodicJobImpl::_run()::{lambda()#2})::{lambda()#1}> > >::_M_run()","s+":"782"}}}
|
{"t":{"$date":"2023-03-09T15:08:59.355-05:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"TimestampMonitor","msg":"Frame","attr":{"frame":{"a":"55CAFEA09E8F","b":"55CAF9A88000","o":"4F81E8F","s":"execute_native_thread_routine","s+":"F"}}}
|
{"t":{"$date":"2023-03-09T15:08:59.355-05:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"TimestampMonitor","msg":"Frame","attr":{"frame":{"a":"7F2F70589B43","b":"7F2F704F5000","o":"94B43","s":"pthread_condattr_setpshared","s+":"513"}}}
|
{"t":{"$date":"2023-03-09T15:08:59.355-05:00"},"s":"I", "c":"CONTROL", "id":31445, "ctx":"TimestampMonitor","msg":"Frame","attr":{"frame":{"a":"7F2F7061BA00","b":"7F2F704F5000","o":"126A00","s":"__xmknodat","s+":"230"}}}
|
|
|
Thanks Chris, I've uploaded 4 .tar.gz files.
mongos-crash.tar.gz - This is for the mongos router
mongod-replicaset_member-crash.tar.gz
mongod-replicaset_member-shard1-crash.tar.gz
mongod-replicaset_member-shard2-crash.tar.gz
mongod-replicaset_member-crash.tar.gz (this was actually for shard2 as well but a different server)
|
|
Hi amit.gupta@opensense.com,
I've created a secure upload portal for you. Files uploaded to this portal are hosted on Box, are visible only to MongoDB employees, and are routinely deleted after some time.
For each node in the replica set spanning a time period that includes the incident, would you please archive (tar or zip) and upload to that link:
- the mongod logs
- the $dbpath/diagnostic.data directory (the contents are described here)
Christopher
|
|
Here is a stack trace from yet another crash that just happened on a different secondary in a replicaSet:
{"t":\{"$date":"2023-03-09T08:51:55.263-05:00"}
|
,"s":"F", "c":"CONTROL", "id":6384300, "ctx":"conn3397","msg":"Writing fatal message","attr":{"message":"Got signal: 11 (Segmentation fault).\n"}}
|
{"t":\{"$date":"2023-03-09T08:51:55.433-05:00"}
|
,"s":"I", "c":"CONTROL", "id":31380, "ctx":"conn3397","msg":"BACKTRACE","attr":{"bt":{"backtrace":[
|
{"a":"56099B1D68D4","b":"5609963F8000","o":"4DDE8D4","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.362","C":"mongo::stack_trace_detail::(anonymous namespace)::printStackTraceImpl(mongo::stack_trace_detail::(anonymous namespace)::Options const&, mongo::StackTraceSink*) [clone .constprop.362]","s+":"1F4"}
|
,{"a":"56099B1D8E19","b":"5609963F8000","o":"4DE0E19","s":"ZN5mongo15printStackTraceEv","C":"mongo::printStackTrace()","s+":"29"},{"a":"56099B1D2C4C","b":"5609963F8000","o":"4DDAC4C","s":"abruptQuitWithAddrSignal","s+":"EC"},{"a":"7F9C219F9520","b":"7F9C219B7000","o":"42520","s":"sigaction","s+":"50"},{"a":"56099964AACE","b":"5609963F8000","o":"3252ACE","s":"_ZN4absl12lts_2021032418container_internal12raw_hash_setINS1_17NodeHashMapPolicyIxPN5mongo12ClientCursorEEENS0_13hash_internal4HashIxEESt8equal_toIxESaISt4pairIKxS6_EEE4findIxEENSH_8iteratorERSE_m","C":"absl::lts_20210324::container_internal::raw_hash_set<absl::lts_20210324::container_internal::NodeHashMapPolicy<long long, mongo::ClientCursor*>, absl::lts_20210324::hash_internal::Hash<long long>, std::equal_to<long long>, std::allocator<std::pair<long long const, mongo::ClientCursor*> > >::iterator absl::lts_20210324::container_internal::raw_hash_set<absl::lts_20210324::container_internal::NodeHashMapPolicy<long long, mongo::ClientCursor*>, absl::lts_20210324::hash_internal::Hash<long long>, std::equal_to<long long>, std::allocator<std::pair<long long const, mongo::ClientCursor*> > >::find<long long>(long long const&, unsigned long)","s+":"3E"},{"a":"560999647625","b":"5609963F8000","o":"324F625","s":"_ZN5mongo13CursorManager9pinCursorEPNS_16OperationContextExRKSt8functionIFvRKNS_12ClientCursorEEENS0_9AuthCheckE","C":"mongo::CursorManager::pinCursor(mongo::OperationContext*, long long, std::function<void (mongo::ClientCursor const&)> const&, mongo::CursorManager::AuthCheck)","s+":"F5"},{"a":"56099911AD5D","b":"5609963F8000","o":"2D22D5D","s":"_ZN5mongo12_GLOBAL_N_110GetMoreCmd10Invocation3runEPNS_16OperationContextEPNS_3rpc21ReplyBuilderInterfaceE","C":"mongo::(anonymous namespace)::GetMoreCmd::Invocation::run(mongo::OperationContext*, mongo::rpc::ReplyBuilderInterface*)","s+":"14D"},{"a":"560999F625A1","b":"5609963F8000","o":"3B6A5A1","s":"ZN5mongo14CommandHelpers20runCommandInvocationEPNS_16OperationContextERKNS_12OpMsgRequestEPNS_17CommandInvocationEPNS_3rpc21ReplyBuilderInterfaceE","C":"mongo::CommandHelpers::runCommandInvocation(mongo::OperationContext*, mongo::OpMsgRequest const&, mongo::CommandInvocation*, mongo::rpc::ReplyBuilderInterface*)","s+":"61"},{"a":"560999F66619","b":"5609963F8000","o":"3B6E619","s":"_ZN5mongo14CommandHelpers20runCommandInvocationESt10shared_ptrINS_23RequestExecutionContextEES1_INS_17CommandInvocationEENS_9transport15ServiceExecutor14ThreadingModelE","C":"mongo::CommandHelpers::runCommandInvocation(std::shared_ptr<mongo::RequestExecutionContext>, std::shared_ptr<mongo::CommandInvocation>, mongo::transport::ServiceExecutor::ThreadingModel)","s+":"1C9"},{"a":"56099855FB27","b":"5609963F8000","o":"2167B27","s":"_ZN5mongo12_GLOBAL_N_120runCommandInvocationESt10shared_ptrINS_23RequestExecutionContextEES1_INS_17CommandInvocationEE","C":"mongo::(anonymous namespace)::runCommandInvocation(std::shared_ptr<mongo::RequestExecutionContext>, std::shared_ptr<mongo::CommandInvocation>)","s+":"97"},{"a":"56099856FC26","b":"5609963F8000","o":"2177C26","s":"ZN5mongo12_GLOBAL_N_114RunCommandImpl11_runCommandEv","C":"mongo::(anonymous namespace)::RunCommandImpl::runCommand()","s+":"146"},{"a":"5609985704B6","b":"5609963F8000","o":"21784B6","s":"_ZN5mongo12_GLOBAL_N_132RunCommandAndWaitForWriteConcern24_runCommandWithFailPointEv","C":"mongo::(anonymous namespace)::RunCommandAndWaitForWriteConcern::runCommandWithFailPoint()","s+":"86"},{"a":"56099857097C","b":"5609963F8000","o":"217897C","s":"_ZN5mongo12_GLOBAL_N_132RunCommandAndWaitForWriteConcern8_runImplEv","C":"mongo::(anonymous namespace)::RunCommandAndWaitForWriteConcern::runImpl()","s+":"3C"},{"a":"560998560BDC","b":"5609963F8000","o":"2168BDC","s":"_ZN5mongo12_GLOBAL_N_114RunCommandImpl3runEv","C":"mongo::(anonymous namespace)::RunCommandImpl::run()","s+":"13C"},{"a":"560998567BB4","b":"5609963F8000","o":"216FBB4","s":"ZN5mongo12_GLOBAL_N_119ExecCommandDatabase12_commandExecEv","C":"mongo::(anonymous namespace)::ExecCommandDatabase::commandExec()","s+":"1E4"},{"a":"56099856C85F","b":"5609963F8000","o":"217485F","s":"_ZZN5mongo12_GLOBAL_N_114executeCommandESt10shared_ptrINS0_13HandleRequest16ExecutionContextEEENUlvE0_clEv","C":"mongo::(anonymous namespace)::executeCommand(std::shared_ptr<mongo::(anonymous namespace)::HandleRequest::ExecutionContext>)::
|
{lambda()#2}::operator()()","s+":"43F"},{"a":"56099856E450","b":"5609963F8000","o":"2176450","s":"ZZN5mongo15unique_functionIFvPNS_14future_details15SharedStateBaseEEE8makeImplIZNS1_10FutureImplINS1_8FakeVoidEE16makeContinuationIvZZNOS9_4thenIZNS_12_GLOBAL_N_114executeCommandESt10shared_ptrINSC_13HandleRequest16ExecutionContextEEEUlvE0_EEDaOT_ENKUlvE1_clEvEUlPNS1_15SharedStateImplIS8_EESN_E_EENS7_ISI_EEOT0_EUlS3_E_EEDaSJ_EN12SpecificImpl4callEOS3_","C":"mongo::unique_function<void (mongo::future_details::SharedStateBase*)>::makeImpl<mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::makeContinuation<void, mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::then<mongo::(anonymous namespace)::executeCommand(std::shared_ptr<mongo::(anonymous namespace)::HandleRequest::ExecutionContext>)::{lambda()#2}
|
>(mongo::(anonymous namespace)::executeCommand(std::shared_ptr<mongo::(anonymous namespace)::HandleRequest::ExecutionContext>)::{lambda()#2}&&) &&::{lambda()#3}::operator()() const::{lambda(mongo::future_details::SharedStateImpl<mongo::future_details::FakeVoid>, mongo::future_details::SharedStateImpl)#1}>(mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::then<mongo::(anonymous namespace)::executeCommand(std::shared_ptr<mongo::(anonymous namespace)::HandleRequest::ExecutionContext>)::{lambda()#2}>(mongo::(anonymous namespace)::executeCommand(std::shared_ptr<mongo::(anonymous namespace)::HandleRequest::ExecutionContext>)::{lambda()#2}&&) &&::{lambda()#3}::operator()() const::{lambda(mongo::future_details::SharedStateImpl<mongo::future_details::FakeVoid>, mongo::future_details::SharedStateImpl)#1}&&)::{lambda(mongo::future_details::SharedStateBase*)#1}>(mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::makeContinuation<void, mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::then<mongo::(anonymous namespace)::executeCommand(std::shared_ptr<mongo::(anonymous namespace)::HandleRequest::ExecutionContext>)::{lambda()#2}>(mongo::(anonymous namespace)::executeCommand(std::shared_ptr<mongo::(anonymous namespace)::HandleRequest::ExecutionContext>)::{lambda()#2}&&) &&::{lambda()#3}::operator()() const::{lambda(mongo::future_details::SharedStateImpl<mongo::future_details::FakeVoid>, mongo::future_details::SharedStateImpl)#1}>(mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::then<mongo::(anonymous namespace)::executeCommand(std::shared_ptr<mongo::(anonymous namespace)::HandleRequest::ExecutionContext>)::{lambda()#2}>(mongo::(anonymous namespace)::executeCommand(std::shared_ptr<mongo::(anonymous namespace)::HandleRequest::ExecutionContext>)::{lambda()#2}&&) &&::{lambda()#3}::operator()() const::{lambda(mongo::future_details::SharedStateImpl<mongo::future_details::FakeVoid>, mongo::future_details::SharedStateImpl)#1}&&)::{lambda(mongo::future_details::SharedStateBase*)#1})::SpecificImpl::call(mongo::future_details::SharedStateBase*&&)","s+":"80"},{"a":"56099851EEB7","b":"5609963F8000","o":"2126EB7","s":"ZN5mongo14future_details15SharedStateBase20transitionToFinishedEv","C":"mongo::future_details::SharedStateBase::transitionToFinished()","s+":"147"},{"a":"5609985723B2","b":"5609963F8000","o":"217A3B2","s":"_ZN5mongo14future_details10FutureImplINS0_8FakeVoidEE11generalImplIZNOS3_17propagateResultToEPNS0_15SharedStateImplIS2_EEEUlOS2_E_ZNOS3_17propagateResultToES7_EUlONS_6StatusEE0_ZNOS3_17propagateResultToES7_EUlvE1_EEDaOT_OT0_OT1","C":"auto mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::generalImpl<mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::propagateResultTo(mongo::future_details::SharedStateImpl<mongo::future_details::FakeVoid>*) &&::
|
{lambda(mongo::future_details::FakeVoid&&)#1}
|
, mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::propagateResultTo(mongo::future_details::SharedStateImpl<mongo::future_details::FakeVoid>) &&::{lambda(mongo::Status&&)#2}, mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::propagateResultTo(mongo::future_details::SharedStateImpl<mongo::future_details::FakeVoid>) &&::{lambda()#3}>(mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::propagateResultTo(mongo::future_details::SharedStateImpl<mongo::future_details::FakeVoid>) &&::{lambda(mongo::future_details::FakeVoid&&)#1}&&, mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::propagateResultTo(mongo::future_details::SharedStateImpl<mongo::future_details::FakeVoid>) &&::{lambda(mongo::Status&&)#2}&&, mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::propagateResultTo(mongo::future_details::SharedStateImpl<mongo::future_details::FakeVoid>) &&::{lambda()#3}&&)","s+":"152"},{"a":"560998562D1C","b":"5609963F8000","o":"216AD1C","s":"ZZN5mongo15unique_functionIFvPNS_14future_details15SharedStateBaseEEE8makeImplIZNS1_10FutureImplINS1_8FakeVoidEE16makeContinuationIvZZNOS9_4thenIZNS_12_GLOBAL_N_114executeCommandESt10shared_ptrINSC_13HandleRequest16ExecutionContextEEEUlvE_EEDaOT_ENKUlvE1_clEvEUlPNS1_15SharedStateImplIS8_EESN_E_EENS7_ISI_EEOT0_EUlS3_E_EEDaSJ_EN12SpecificImpl4callEOS3_","C":"mongo::unique_function<void (mongo::future_details::SharedStateBase)>::makeImpl<mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::makeContinuation<void, mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::then<mongo::(anonymous namespace)::executeCommand(std::shared_ptr<mongo::(anonymous namespace)::HandleRequest::ExecutionContext>)::
|
{lambda()#1}>(mongo::(anonymous namespace)::executeCommand(std::shared_ptr<mongo::(anonymous namespace)::HandleRequest::ExecutionContext>)::{lambda()#1}&&) &&::{lambda()#3}::operator()() const::{lambda(mongo::future_details::SharedStateImpl<mongo::future_details::FakeVoid>, mongo::future_details::SharedStateImpl)#1}>(mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::then<mongo::(anonymous namespace)::executeCommand(std::shared_ptr<mongo::(anonymous namespace)::HandleRequest::ExecutionContext>)::{lambda()#1}>(mongo::(anonymous namespace)::executeCommand(std::shared_ptr<mongo::(anonymous namespace)::HandleRequest::ExecutionContext>)::{lambda()#1}&&) &&::{lambda()#3}::operator()() const::{lambda(mongo::future_details::SharedStateImpl<mongo::future_details::FakeVoid>, mongo::future_details::SharedStateImpl)#1}&&)::{lambda(mongo::future_details::SharedStateBase*)#1}>(mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::makeContinuation<void, mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::then<mongo::(anonymous namespace)::executeCommand(std::shared_ptr<mongo::(anonymous namespace)::HandleRequest::ExecutionContext>)::{lambda()#1}>(mongo::(anonymous namespace)::executeCommand(std::shared_ptr<mongo::(anonymous namespace)::HandleRequest::ExecutionContext>)::{lambda()#1}&&) &&::{lambda()#3}::operator()() const::{lambda(mongo::future_details::SharedStateImpl<mongo::future_details::FakeVoid>, mongo::future_details::SharedStateImpl)#1}>(mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::then<mongo::(anonymous namespace)::executeCommand(std::shared_ptr<mongo::(anonymous namespace)::HandleRequest::ExecutionContext>)::{lambda()#1}>(mongo::(anonymous namespace)::executeCommand(std::shared_ptr<mongo::(anonymous namespace)::HandleRequest::ExecutionContext>)::{lambda()#1}&&) &&::{lambda()#3}::operator()() const::{lambda(mongo::future_details::SharedStateImpl<mongo::future_details::FakeVoid>, mongo::future_details::SharedStateImpl)#1}&&)::{lambda(mongo::future_details::SharedStateBase*)#1})::SpecificImpl::call(mongo::future_details::SharedStateBase*&&)","s+":"AC"},{"a":"56099851EEB7","b":"5609963F8000","o":"2126EB7","s":"ZN5mongo14future_details15SharedStateBase20transitionToFinishedEv","C":"mongo::future_details::SharedStateBase::transitionToFinished()","s+":"147"},{"a":"56099856D1E3","b":"5609963F8000","o":"21751E3","s":"_ZN5mongo12_GLOBAL_N_114executeCommandESt10shared_ptrINS0_13HandleRequest16ExecutionContextEE","C":"mongo::(anonymous namespace)::executeCommand(std::shared_ptr<mongo::(anonymous namespace)::HandleRequest::ExecutionContext>)","s+":"1D3"},{"a":"56099856DA55","b":"5609963F8000","o":"2175A55","s":"ZN5mongo12_GLOBAL_N_116receivedCommandsESt10shared_ptrINS0_13HandleRequest16ExecutionContextEE","C":"mongo::(anonymous namespace)::receivedCommands(std::shared_ptr<mongo::(anonymous namespace)::HandleRequest::ExecutionContext>)","s+":"2D5"},{"a":"56099856E396","b":"5609963F8000","o":"2176396","s":"ZN5mongo12_GLOBAL_N_115CommandOpRunner3runEv","C":"mongo::(anonymous namespace)::CommandOpRunner::run()","s+":"36"},{"a":"56099856536F","b":"5609963F8000","o":"216D36F","s":"ZN5mongo23ServiceEntryPointCommon13handleRequestEPNS_16OperationContextERKNS_7MessageESt10unique_ptrIKNS0_5HooksESt14default_deleteIS8_EE","C":"mongo::ServiceEntryPointCommon::handleRequest(mongo::OperationContext*, mongo::Message const&, std::unique_ptr<mongo::ServiceEntryPointCommon::Hooks const, std::default_delete<mongo::ServiceEntryPointCommon::Hooks const> >)","s+":"53F"},{"a":"560998559B2F","b":"5609963F8000","o":"2161B2F","s":"_ZN5mongo23ServiceEntryPointMongod13handleRequestEPNS_16OperationContextERKNS_7MessageE","C":"mongo::ServiceEntryPointMongod::handleRequest(mongo::OperationContext*, mongo::Message const&)","s+":"4F"},{"a":"5609993BFEB7","b":"5609963F8000","o":"2FC7EB7","s":"_ZN5mongo9transport19ServiceStateMachine4Impl14processMessageEv","C":"mongo::transport::ServiceStateMachine::Impl::processMessage()","s+":"127"},{"a":"5609993C1BC6","b":"5609963F8000","o":"2FC9BC6","s":"_ZN5mongo9transport19ServiceStateMachine4Impl12startNewLoopERKNS_6StatusE","C":"mongo::transport::ServiceStateMachine::Impl::startNewLoop(mongo::Status const&)","s+":"76"},{"a":"5609993C205F","b":"5609963F8000","o":"2FCA05F","s":"_ZZN5mongo15unique_functionIFvNS_6StatusEEE8makeImplIZNS_9transport19ServiceStateMachine4Impl15scheduleNewLoopES1_EUlS1_E_EEDaOT_EN12SpecificImpl4callEOS1","C":"mongo::unique_function<void (mongo::Status)>::makeImpl<mongo::transport::ServiceStateMachine::Impl::scheduleNewLoop(mongo::Status)::{lambda(mongo::Status)#1}>(mongo::transport::ServiceStateMachine::Impl::scheduleNewLoop(mongo::Status)::{lambda(mongo::Status)#1}&&)::SpecificImpl::call(mongo::Status&&)","s+":"7F"},{"a":"560998AACEE3","b":"5609963F8000","o":"26B4EE3","s":"_ZZN5mongo15unique_functionIFvvEE8makeImplIZNS_9transport15ServiceExecutor8scheduleENS0_IFvNS_6StatusEEEEEUlvE_EEDaOT_EN12SpecificImpl4callEv","C":"mongo::unique_function<void ()>::makeImpl<mongo::transport::ServiceExecutor::schedule(mongo::unique_function<void (mongo::Status)>)::{lambda()#1}
|
>(mongo::transport::ServiceExecutor::schedule(mongo::unique_function<void (mongo::Status)>)::{lambda()#1}&&)::SpecificImpl::call()","s+":"33"},{"a":"56099A9D864B","b":"5609963F8000","o":"45E064B","s":"_ZZN5mongo15unique_functionIFvvEE8makeImplIZNS_9transport26ServiceExecutorSynchronous12scheduleTaskES2_NS4_15ServiceExecutor13ScheduleFlagsEEUlvE0_EEDaOT_EN12SpecificImpl4callEv","C":"mongo::unique_function<void ()>::makeImpl<mongo::transport::ServiceExecutorSynchronous::scheduleTask(mongo::unique_function<void ()>, mongo::transport::ServiceExecutor::ScheduleFlags)::
|
{lambda()#2}
|
>(mongo::transport::ServiceExecutorSynchronous::scheduleTask(mongo::unique_function<void ()>, mongo::transport::ServiceExecutor::ScheduleFlags)::{lambda()#2}&&)::SpecificImpl::call()","s+":"BB"},{"a":"56099A9DA490","b":"5609963F8000","o":"45E2490","s":"_ZZN5mongo15unique_functionIFvvEE8makeImplIZNS_25launchServiceWorkerThreadES2_EUlvE2_EEDaOT_EN12SpecificImpl4callEv","C":"mongo::unique_function<void ()>::makeImpl<mongo::launchServiceWorkerThread(mongo::unique_function<void ()>)::
|
{lambda()#4}
|
>(mongo::launchServiceWorkerThread(mongo::unique_function<void ()>)::{lambda()#4}&&)::SpecificImpl::call()","s+":"80"},{"a":"56099A9DA53C","b":"5609963F8000","o":"45E253C","s":"ZN5mongo12_GLOBAL_N_17runFuncEPv","C":"mongo::(anonymous namespace)::runFunc(void*)","s+":"1C"},{"a":"7F9C21A4BB43","b":"7F9C219B7000","o":"94B43","s":"pthread_condattr_setpshared","s+":"513"},{"a":"7F9C21ADDA00","b":"7F9C219B7000","o":"126A00","s":"__xmknodat","s+":"230"}],"processInfo":{"mongodbVersion":"6.0.4","gitVersion":"44ff59461c1353638a71e710f385a566bcd2f547","compiledModules":[],"uname":
|
{"sysname":"Linux","release":"5.15.0-67-generic","version":"#74-Ubuntu SMP Wed Feb 22 14:14:39 UTC 2023","machine":"x86_64"}
|
,"somap":[\{"b":"5609963F8000","elfType":3,"buildId":"BCB8F6C3AD4ABF41C81882BAE25C6885C6266CC4"},\{"b":"7F9C219B7000","path":"/lib/x86_64-linux-gnu/libc.so.6","elfType":3,"buildId":"69389D485A9793DBE873F0EA2C93E02EFAA9AA3D"}]}}},"tags":[]}
|
{"t":\{"$date":"2023-03-09T08:51:55.433-05:00"}
|
,"s":"I", "c":"CONTROL", "id":31445, "ctx":"conn3397","msg":"Frame","attr":{"frame":
|
{"a":"56099B1D68D4","b":"5609963F8000","o":"4DDE8D4","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.362","C":"mongo::stack_trace_detail::(anonymous namespace)::printStackTraceImpl(mongo::stack_trace_detail::(anonymous namespace)::Options const&, mongo::StackTraceSink*) [clone .constprop.362]","s+":"1F4"}
|
}}
|
{"t":\{"$date":"2023-03-09T08:51:55.433-05:00"}
|
,"s":"I", "c":"CONTROL", "id":31445, "ctx":"conn3397","msg":"Frame","attr":{"frame":
|
{"a":"56099B1D8E19","b":"5609963F8000","o":"4DE0E19","s":"_ZN5mongo15printStackTraceEv","C":"mongo::printStackTrace()","s+":"29"}
|
}}
|
{"t":\{"$date":"2023-03-09T08:51:55.433-05:00"}
|
,"s":"I", "c":"CONTROL", "id":31445, "ctx":"conn3397","msg":"Frame","attr":{"frame":
|
{"a":"56099B1D2C4C","b":"5609963F8000","o":"4DDAC4C","s":"abruptQuitWithAddrSignal","s+":"EC"}
|
}}
|
{"t":\{"$date":"2023-03-09T08:51:55.433-05:00"}
|
,"s":"I", "c":"CONTROL", "id":31445, "ctx":"conn3397","msg":"Frame","attr":{"frame":
|
{"a":"7F9C219F9520","b":"7F9C219B7000","o":"42520","s":"__sigaction","s+":"50"}
|
}}
|
{"t":\{"$date":"2023-03-09T08:51:55.433-05:00"}
|
,"s":"I", "c":"CONTROL", "id":31445, "ctx":"conn3397","msg":"Frame","attr":{"frame":
|
{"a":"56099964AACE","b":"5609963F8000","o":"3252ACE","s":"_ZN4absl12lts_2021032418container_internal12raw_hash_setINS1_17NodeHashMapPolicyIxPN5mongo12ClientCursorEEENS0_13hash_internal4HashIxEESt8equal_toIxESaISt4pairIKxS6_EEE4findIxEENSH_8iteratorERSE_m","C":"absl::lts_20210324::container_internal::raw_hash_set<absl::lts_20210324::container_internal::NodeHashMapPolicy<long long, mongo::ClientCursor*>, absl::lts_20210324::hash_internal::Hash<long long>, std::equal_to<long long>, std::allocator<std::pair<long long const, mongo::ClientCursor*> > >::iterator absl::lts_20210324::container_internal::raw_hash_set<absl::lts_20210324::container_internal::NodeHashMapPolicy<long long, mongo::ClientCursor*>, absl::lts_20210324::hash_internal::Hash<long long>, std::equal_to<long long>, std::allocator<std::pair<long long const, mongo::ClientCursor*> > >::find<long long>(long long const&, unsigned long)","s+":"3E"}
|
}}
|
{"t":\{"$date":"2023-03-09T08:51:55.433-05:00"}
|
,"s":"I", "c":"CONTROL", "id":31445, "ctx":"conn3397","msg":"Frame","attr":{"frame":
|
{"a":"560999647625","b":"5609963F8000","o":"324F625","s":"_ZN5mongo13CursorManager9pinCursorEPNS_16OperationContextExRKSt8functionIFvRKNS_12ClientCursorEEENS0_9AuthCheckE","C":"mongo::CursorManager::pinCursor(mongo::OperationContext*, long long, std::function<void (mongo::ClientCursor const&)> const&, mongo::CursorManager::AuthCheck)","s+":"F5"}
|
}}
|
|
|
We had another replicaset member die in a different shard with a similar error. Saved the logs and diagnostic data.
|
|
We just got an alert that one of shard members died in a 5-node replicaset. Below is last lines in the mongod.log. Saved for upload as well (I had to cut off part of the error since it was too long for this comment)
|
{"t":\{"$date":"2023-03-09T08:51:55.263-05:00"}
|
,"s":"F", "c":"CONTROL", "id":6384300, "ctx":"conn3397","msg":"Writing fatal message","attr":{"message":"Invalid access at address: 0\n"}}
|
{"t":\{"$date":"2023-03-09T08:51:55.263-05:00"}
|
,"s":"F", "c":"CONTROL", "id":6384300, "ctx":"conn3397","msg":"Writing fatal message","attr":{"message":"Got signal: 11 (Segmentation fault).\n"}}
|
{"t":\{"$date":"2023-03-09T08:51:55.433-05:00"}
|
,"s":"I", "c":"CONTROL", "id":31380, "ctx":"conn3397","msg":"BACKTRACE","attr":{"bt":{"backtrace":[
|
{"a":"56099B1D68D4","b":"5609963F8000","o":"4DDE8D4","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.362","C":"mongo::stack_trace_detail::(anonymous namespace)::printStackTraceImpl(mongo::stack_trace_detail::(anonymous namespace)::Options const&, mongo::StackTraceSink*) [clone .constprop.362]","s+":"1F4"}
|
,{"a":"56099B1D8E19","b":"5609963F8000","o":"4DE0E19","s":"ZN5mongo15printStackTraceEv","C":"mongo::printStackTrace()","s+":"29"},{"a":"56099B1D2C4C","b":"5609963F8000","o":"4DDAC4C","s":"abruptQuitWithAddrSignal","s+":"EC"},{"a":"7F9C219F9520","b":"7F9C219B7000","o":"42520","s":"sigaction","s+":"50"},{"a":"56099964AACE","b":"5609963F8000","o":"3252ACE","s":"_ZN4absl12lts_2021032418container_internal12raw_hash_setINS1_17NodeHashMapPolicyIxPN5mongo12ClientCursorEEENS0_13hash_internal4HashIxEESt8equal_toIxESaISt4pairIKxS6_EEE4findIxEENSH_8iteratorERSE_m","C":"absl::lts_20210324::container_internal::raw_hash_set<absl::lts_20210324::container_internal::NodeHashMapPolicy<long long, mongo::ClientCursor*>, absl::lts_20210324::hash_internal::Hash<long long>, std::equal_to<long long>, std::allocator<std::pair<long long const, mongo::ClientCursor*> > >::iterator absl::lts_20210324::container_internal::raw_hash_set<absl::lts_20210324::container_internal::NodeHashMapPolicy<long long, mongo::ClientCursor*>, absl::lts_20210324::hash_internal::Hash<long long>, std::equal_to<long long>, std::allocator<std::pair<long long const, mongo::ClientCursor*> > >::find<long long>(long long const&, unsigned long)","s+":"3E"},{"a":"560999647625","b":"5609963F8000","o":"324F625","s":"_ZN5mongo13CursorManager9pinCursorEPNS_16OperationContextExRKSt8functionIFvRKNS_12ClientCursorEEENS0_9AuthCheckE","C":"mongo::CursorManager::pinCursor(mongo::OperationContext*, long long, std::function<void (mongo::ClientCursor const&)> const&, mongo::CursorManager::AuthCheck)","s+":"F5"},{"a":"56099911AD5D","b":"5609963F8000","o":"2D22D5D","s":"_ZN5mongo12_GLOBAL_N_110GetMoreCmd10Invocation3runEPNS_16OperationContextEPNS_3rpc21ReplyBuilderInterfaceE","C":"mongo::(anonymous namespace)::GetMoreCmd::Invocation::run(mongo::OperationContext*, mongo::rpc::ReplyBuilderInterface*)","s+":"14D"},{"a":"560999F625A1","b":"5609963F8000","o":"3B6A5A1","s":"ZN5mongo14CommandHelpers20runCommandInvocationEPNS_16OperationContextERKNS_12OpMsgRequestEPNS_17CommandInvocationEPNS_3rpc21ReplyBuilderInterfaceE","C":"mongo::CommandHelpers::runCommandInvocation(mongo::OperationContext*, mongo::OpMsgRequest const&, mongo::CommandInvocation*, mongo::rpc::ReplyBuilderInterface*)","s+":"61"},{"a":"560999F66619","b":"5609963F8000","o":"3B6E619","s":"_ZN5mongo14CommandHelpers20runCommandInvocationESt10shared_ptrINS_23RequestExecutionContextEES1_INS_17CommandInvocationEENS_9transport15ServiceExecutor14ThreadingModelE","C":"mongo::CommandHelpers::runCommandInvocation(std::shared_ptr<mongo::RequestExecutionContext>, std::shared_ptr<mongo::CommandInvocation>, mongo::transport::ServiceExecutor::ThreadingModel)","s+":"1C9"},{"a":"56099855FB27","b":"5609963F8000","o":"2167B27","s":"_ZN5mongo12_GLOBAL_N_120runCommandInvocationESt10shared_ptrINS_23RequestExecutionContextEES1_INS_17CommandInvocationEE","C":"mongo::(anonymous namespace)::runCommandInvocation(std::shared_ptr<mongo::RequestExecutionContext>, std::shared_ptr<mongo::CommandInvocation>)","s+":"97"},{"a":"56099856FC26","b":"5609963F8000","o":"2177C26","s":"ZN5mongo12_GLOBAL_N_114RunCommandImpl11_runCommandEv","C":"mongo::(anonymous namespace)::RunCommandImpl::runCommand()","s+":"146"},{"a":"5609985704B6","b":"5609963F8000","o":"21784B6","s":"_ZN5mongo12_GLOBAL_N_132RunCommandAndWaitForWriteConcern24_runCommandWithFailPointEv","C":"mongo::(anonymous namespace)::RunCommandAndWaitForWriteConcern::runCommandWithFailPoint()","s+":"86"},{"a":"56099857097C","b":"5609963F8000","o":"217897C","s":"_ZN5mongo12_GLOBAL_N_132RunCommandAndWaitForWriteConcern8_runImplEv","C":"mongo::(anonymous namespace)::RunCommandAndWaitForWriteConcern::runImpl()","s+":"3C"},{"a":"560998560BDC","b":"5609963F8000","o":"2168BDC","s":"_ZN5mongo12_GLOBAL_N_114RunCommandImpl3runEv","C":"mongo::(anonymous namespace)::RunCommandImpl::run()","s+":"13C"},{"a":"560998567BB4","b":"5609963F8000","o":"216FBB4","s":"ZN5mongo12_GLOBAL_N_119ExecCommandDatabase12_commandExecEv","C":"mongo::(anonymous namespace)::ExecCommandDatabase::commandExec()","s+":"1E4"},{"a":"56099856C85F","b":"5609963F8000","o":"217485F","s":"_ZZN5mongo12_GLOBAL_N_114executeCommandESt10shared_ptrINS0_13HandleRequest16ExecutionContextEEENUlvE0_clEv","C":"mongo::(anonymous namespace)::executeCommand(std::shared_ptr<mongo::(anonymous namespace)::HandleRequest::ExecutionContext>)::
|
{lambda()#2}::operator()()","s+":"43F"},{"a":"56099856E450","b":"5609963F8000","o":"2176450","s":"ZZN5mongo15unique_functionIFvPNS_14future_details15SharedStateBaseEEE8makeImplIZNS1_10FutureImplINS1_8FakeVoidEE16makeContinuationIvZZNOS9_4thenIZNS_12_GLOBAL_N_114executeCommandESt10shared_ptrINSC_13HandleRequest16ExecutionContextEEEUlvE0_EEDaOT_ENKUlvE1_clEvEUlPNS1_15SharedStateImplIS8_EESN_E_EENS7_ISI_EEOT0_EUlS3_E_EEDaSJ_EN12SpecificImpl4callEOS3_","C":"mongo::unique_function<void (mongo::future_details::SharedStateBase*)>::makeImpl<mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::makeContinuation<void, mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::then<mongo::(anonymous namespace)::executeCommand(std::shared_ptr<mongo::(anonymous namespace)::HandleRequest::ExecutionContext>)::{lambda()#2}
|
>(mongo::(anonymous namespace)::executeCommand(std::shared_ptr<mongo::(anonymous namespace)::HandleRequest::ExecutionContext>)::{lambda()#2}&&) &&::{lambda()#3}::operator()() const::{lambda(mongo::future_details::SharedStateImpl<mongo::future_details::FakeVoid>, mongo::future_details::SharedStateImpl)#1}>(mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::then<mongo::(anonymous namespace)::executeCommand(std::shared_ptr<mongo::(anonymous namespace)::HandleRequest::ExecutionContext>)::{lambda()#2}>(mongo::(anonymous namespace)::executeCommand(std::shared_ptr<mongo::(anonymous namespace)::HandleRequest::ExecutionContext>)::{lambda()#2}&&) &&::{lambda()#3}::operator()() const::{lambda(mongo::future_details::SharedStateImpl<mongo::future_details::FakeVoid>, mongo::future_details::SharedStateImpl)#1}&&)::{lambda(mongo::future_details::SharedStateBase*)#1}>(mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::makeContinuation<void, mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::then<mongo::(anonymous namespace)::executeCommand(std::shared_ptr<mongo::(anonymous namespace)::HandleRequest::ExecutionContext>)::{lambda()#2}>(mongo::(anonymous namespace)::executeCommand(std::shared_ptr<mongo::(anonymous namespace)::HandleRequest::ExecutionContext>)::{lambda()#2}&&) &&::{lambda()#3}::operator()() const::{lambda(mongo::future_details::SharedStateImpl<mongo::future_details::FakeVoid>, mongo::future_details::SharedStateImpl)#1}>(mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::then<mongo::(anonymous namespace)::executeCommand(std::shared_ptr<mongo::(anonymous namespace)::HandleRequest::ExecutionContext>)::{lambda()#2}>(mongo::(anonymous namespace)::executeCommand(std::shared_ptr<mongo::(anonymous namespace)::HandleRequest::ExecutionContext>)::{lambda()#2}&&) &&::{lambda()#3}::operator()() const::{lambda(mongo::future_details::SharedStateImpl<mongo::future_details::FakeVoid>, mongo::future_details::SharedStateImpl)#1}&&)::{lambda(mongo::future_details::SharedStateBase*)#1})::SpecificImpl::call(mongo::future_details::SharedStateBase*&&)","s+":"80"},{"a":"56099851EEB7","b":"5609963F8000","o":"2126EB7","s":"ZN5mongo14future_details15SharedStateBase20transitionToFinishedEv","C":"mongo::future_details::SharedStateBase::transitionToFinished()","s+":"147"},{"a":"5609985723B2","b":"5609963F8000","o":"217A3B2","s":"_ZN5mongo14future_details10FutureImplINS0_8FakeVoidEE11generalImplIZNOS3_17propagateResultToEPNS0_15SharedStateImplIS2_EEEUlOS2_E_ZNOS3_17propagateResultToES7_EUlONS_6StatusEE0_ZNOS3_17propagateResultToES7_EUlvE1_EEDaOT_OT0_OT1","C":"auto mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::generalImpl<mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::propagateResultTo(mongo::future_details::SharedStateImpl<mongo::future_details::FakeVoid>*) &&::
|
{lambda(mongo::future_details::FakeVoid&&)#1}
|
, mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::propagateResultTo(mongo::future_details::SharedStateImpl<mongo::future_details::FakeVoid>) &&::{lambda(mongo::Status&&)#2}, mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::propagateResultTo(mongo::future_details::SharedStateImpl<mongo::future_details::FakeVoid>) &&::{lambda()#3}>(mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::propagateResultTo(mongo::future_details::SharedStateImpl<mongo::future_details::FakeVoid>) &&::{lambda(mongo::future_details::FakeVoid&&)#1}&&, mongo::future_details::FutureImpl<mongo::future_details::FakeVoid>::propagateResultTo(mongo::future_details::SharedStateImpl<mongo::future_details::FakeVoid>) &&::
|
|
|
When looking through our systemd logs, it looks like the mongos router actually exited once by itself about 2 minutes and 30 seconds before the crash above:
Mar 9 08:19:30 appserver systemd[1]: mongos.service: Main process exited, code=exited, status=14/n/a
Mar 9 08:22:00 appserver systemd[1]: mongos.service: Main process exited, code=killed, status=11/SEGV
|
Generated at Thu Feb 08 06:28:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.