[SERVER-53482] Invariant failure: opCtx != nullptr && _opCtx == nullptr in src/mongo/db/client.cpp, line: 126 Created: 22/Dec/20 Updated: 27/Jan/21 Resolved: 27/Jan/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 4.4.1, 4.4.2 |
| Fix Version/s: | None |
| Type: | Question | Priority: | Major - P3 |
| Reporter: | Brian Granetzke | Assignee: | Dmitry Agranat |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
I've seen the above on three different occasions since upgrading to to 4.4.x. The first time was on 4.4.1. I was adding a hidden node to the replica set. The primary locked up for a few seconds and then all three nodes crashed with the same fatal error. The second time was on 4.4.x and I don't recall anything special going on. It only crashed one instance that time. The third time was on 4.4.2-ent (Cloud Manager Backup limited license) and it happened unprovoked. It started with the primary locking up for about 10 seconds (no log activity, failed health check) and then the remainder of the servers crashed as soon as they were elected primary. In all three cases, the ctx in the log was: TopologyVersionObserver and the most recent case also had logs with ctx=monitoring-keys-for-HMAC. I could not find any similar issues regarding this particular invariant failure (client.cpp). Happy to provide additional context/logs if you feel this is worth investigating, otherwise I will likely have to revert to 4.2. The above invariant failure has only happened on 3 occasions, but the mongod freeze they started with has happened over a dozen times in the last 2 months. |
| Comments |
| Comment by Dmitry Agranat [ 27/Jan/21 ] |
|
Hi b.granetzke@fetchrewards.com, as this issue is now being investigated in |
| Comment by Brian Granetzke [ 21/Jan/21 ] |
|
I have not seen the issue again since downgrading the FCV to 4.2 (still same binaries). |
| Comment by Dmitry Agranat [ 20/Jan/21 ] |
|
Hi b.granetzke@fetchrewards.com, did you have a chance to collect a core dump? As an update, this issue is now being investigated in Thanks, |
| Comment by Jonathan Streets (Inactive) [ 07/Jan/21 ] |
|
hi b.granetzke@fetchrewards.com, I was able to test these instructions to produce a coredump on amazon linux 2. Please let us know how it goes. Also, please note that a core dump contains PII and will consume approximately as much disk space as the virtual memory in use by the process when it crashes. jon Edit /etc/systemd/system.conf and add
Restart daemon with Change the default coredump name to pipe it to the coredump utility (all on one line): echo '|/usr/lib/systemd/systemd-coredump %p %u %g %s %t %e' > /proc/sys/kernel/core_pattern Make the new core name persist over reboots by changing /etc/sysctl.conf Reload the new configuration with Check the new configuration with sysctl -a | grep kernel | grep core Set the core file size from 0 to unlimited: Check with Restart MongoDB (or reboot) Test by sending SIGSEGV to mongod The coredump should show up in And in |
| Comment by Brian Granetzke [ 06/Jan/21 ] |
|
Also, just an FYI: I'm still running 4.4.2-ent binaries, but I recently set FCV back to 4.2 in prep for a downgrade. I've convinced management to hold off on the downgrade until one more primary failure (e.g. the pause). I will grab the diagnostic info when that occurs and whatever else you need. I assume the FCV isn't related to the underlying cause, but I thought I'd mention it in case it's relevant. |
| Comment by Brian Granetzke [ 06/Jan/21 ] |
|
latest release of amazon linux 2...appreciate the help |
| Comment by Jonathan Streets (Inactive) [ 06/Jan/21 ] |
|
hi b.granetzke@fetchrewards.com, core dumps aren't on by default, and the commands to turn on and collect them depend very much on the operating system. i see that its an Amazon 2. i'll look for some commands. jon |
| Comment by Brian Granetzke [ 06/Jan/21 ] |
|
@Dmitry I tried finding them with the prior crashes but could not find any. Do invariant failures produce them? if so, where should I look? |
| Comment by Dmitry Agranat [ 06/Jan/21 ] |
|
b.granetzke@fetchrewards.com, do you happen to have a core dump from one of the previous events? If not, would it be possible to save one if this failure reoccurs? |
| Comment by Dmitry Agranat [ 04/Jan/21 ] |
|
Thanks b.granetzke@fetchrewards.com for uploading the data, we are looking into this and will update you on our findings. Just to confirm, do you still have diagnostic.data for either of the reported events? If not, as Bruce mentioned above, if this happens again, it would be helpful to collect logs and diagnostic.data; that might have more information about the freeze. |
| Comment by Brian Granetzke [ 28/Dec/20 ] |
|
@Bruce: I can send you logs from all three servers for additional freeze events if you would like data for more than the attached two. The pattern is very similar for all of them. Basically one of the secondaries gets a socket timeout for the primary and initiates an election. My specific search string to quantify the situation is: msg="Starting an election, since we've seen no PRIMARY in election timeout period". I can take the timestamp of the prior NetworkTimeout on the secondary and look on the primary and can see a distinct gap in log events at that time (give or take milliseconds). |
| Comment by Brian Granetzke [ 28/Dec/20 ] |
|
I've uploaded the logs for the two invariant failure events. The file names indicate the exact log time of the oldest invariant failure for the event. In both cases, all three nodes ended up crashing with the same invariant failure. The first event was on v4.4.1 and the second was on 4.4.2-ent. Regarding the "pause", both events started with the primary pausing long enough for it to miss the heartbeat. For the 12/19 event, the pause was on the mongo2 server between 2020-12-19T21:21:02.971 and 2020-12-19T21:21:11.893. You'll notice based on log rate prior to the pause, a 9 second period with no logs is extremely unlikely. The pause events are much more frequent than the invariant failures. We've had 18 random primary failures since the upgrade to 4.4 (8-10 weeks ago). The most recent was 12/23, but the diagnostic data has already rolled since that event as well. |
| Comment by Bruce Lucas (Inactive) [ 23/Dec/20 ] |
|
Also the freeze that you mention is interesting. If you are willing it would be helpful if you could hold off on reverting to 4.2 until it happens one more time, then collect logs and diagnostic.data; that might have more information about the freeze. |
| Comment by Bruce Lucas (Inactive) [ 23/Dec/20 ] |
|
b.granetzke@fetchrewards.com this comment you posted on
I think as you suggested logs in a 2 hour window would be useful, especially if you have logs for all 3 nodes that crashed simultaneously. |
| Comment by Brian Granetzke [ 23/Dec/20 ] |
|
Unfortunately the metrics files have rolled since the most recent event (on the 19th). The log files have rolled as well, but I can pull them from cloudwatch logs. It is a lot of data. Should I just send you 1 hour windows before/after each of the 3 events? |
| Comment by Dmitry Agranat [ 23/Dec/20 ] |
|
Hi b.granetzke@fetchrewards.com, thank you for the report. Would you please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location? Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time. Thanks, |
| Comment by Brian Granetzke [ 22/Dec/20 ] |
|
Actually, my issues looks a lot like |