[SERVER-53482] Invariant failure: opCtx != nullptr && _opCtx == nullptr in src/mongo/db/client.cpp, line: 126 Created: 22/Dec/20  Updated: 27/Jan/21  Resolved: 27/Jan/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.4.1, 4.4.2
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Brian Granetzke Assignee: Dmitry Agranat
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive 2020-11-09T00_16_19.496.zip     Zip Archive 2020-12-19T21_21_13.189.zip    
Issue Links:
Duplicate
is duplicated by SERVER-52735 mongodb crash with "Invariant failure... Closed
Related
related to SERVER-52735 mongodb crash with "Invariant failure... Closed
related to SERVER-53566 Investigate and reproduce "opCtx != n... Closed
Participants:

 Description   

I've seen the above on three different occasions since upgrading to to 4.4.x.

The first time was on 4.4.1.  I was adding a hidden node to the replica set.  The primary locked up for a few seconds and then all three nodes crashed with the same fatal error.

The second time was on 4.4.x and I don't recall anything special going on.  It only crashed one instance that time.

The third time was on 4.4.2-ent (Cloud Manager Backup limited license) and it happened unprovoked.  It started with the primary locking up for about 10 seconds (no log activity, failed health check) and then the remainder of the servers crashed as soon as they were elected primary.

In all three cases, the ctx in the log was: TopologyVersionObserver and the most recent case also had logs with ctx=monitoring-keys-for-HMAC.

I could not find any similar issues regarding this particular invariant failure (client.cpp).

Happy to provide additional context/logs if you feel this is worth investigating, otherwise I will likely have to revert to 4.2.  The above invariant failure has only happened on 3 occasions, but the mongod freeze they started with has happened over a dozen times in the last 2 months.



 Comments   
Comment by Dmitry Agranat [ 27/Jan/21 ]

Hi b.granetzke@fetchrewards.com, as this issue is now being investigated in SERVER-52735, I will go ahead and close this case. Please watch SERVER-52735 for updates.

Comment by Brian Granetzke [ 21/Jan/21 ]

I have not seen the issue again since downgrading the FCV to 4.2 (still same binaries).

Comment by Dmitry Agranat [ 20/Jan/21 ]

Hi b.granetzke@fetchrewards.com, did you have a chance to collect a core dump?

As an update, this issue is now being investigated in SERVER-52735.

Thanks,
Dima

Comment by Jonathan Streets (Inactive) [ 07/Jan/21 ]

hi b.granetzke@fetchrewards.com, I was able to test these instructions to produce a coredump on amazon linux 2. Please let us know how it goes.  Also, please note that a core dump contains PII and will consume approximately as much disk space as the virtual memory in use by the process when it crashes. jon

Edit /etc/systemd/system.conf and add

  • DumpCore=yes
  • DefaultLimitCORE=infinity

Restart daemon with
systemctl daemon-reexec

 Change the default coredump name to pipe it to the coredump utility (all on one line):

echo '|/usr/lib/systemd/systemd-coredump %p %u %g %s %t %e' > /proc/sys/kernel/core_pattern

Make the new core name persist over reboots by changing /etc/sysctl.conf
kernel.core_pattern=|/usr/lib/systemd/systemd-coredump %p %u %g %s %t %e

 Reload the new configuration with 
sysctl -p

 Check the new configuration with sysctl -a | grep kernel | grep core
kernel.core_pattern = |/usr/lib/systemd/systemd-coredump %p %u %g %s %t %e

 Set the core file size from 0 to unlimited:
ulimit -c unlimited

Check with
ulimit -a

Restart MongoDB (or reboot)
systemctl restart mongod

Test by sending SIGSEGV to mongod
ps -ef | grep mongod
kill -11 NNNNN (where NNNNN is the PID of mongod)

The coredump should show up in 
systemd-coredumpctl list

And in
ls /var/lib/systemd/coredump

Comment by Brian Granetzke [ 06/Jan/21 ]

Also, just an FYI:  I'm still running 4.4.2-ent binaries, but I recently set FCV back to 4.2 in prep for a downgrade.  I've convinced management to hold off on the downgrade until one more primary failure (e.g. the pause).  I will grab the diagnostic info when that occurs and whatever else you need.  I assume the FCV isn't related to the underlying cause, but I thought I'd mention it in case it's relevant.

Comment by Brian Granetzke [ 06/Jan/21 ]

latest release of amazon linux 2...appreciate the help

Comment by Jonathan Streets (Inactive) [ 06/Jan/21 ]

hi b.granetzke@fetchrewards.com,

core dumps aren't on by default, and the commands to turn on and collect them depend very much on the operating system.  i see that its an Amazon 2. i'll look for some commands.

 jon 

Comment by Brian Granetzke [ 06/Jan/21 ]

@Dmitry I tried finding them with the prior crashes but could not find any.  Do invariant failures produce them?  if so, where should I look?

Comment by Dmitry Agranat [ 06/Jan/21 ]

b.granetzke@fetchrewards.com, do you happen to have a core dump from one of the previous events? If not, would it be possible to save one if this failure reoccurs?

Comment by Dmitry Agranat [ 04/Jan/21 ]

Thanks b.granetzke@fetchrewards.com for uploading the data, we are looking into this and will update you on our findings. Just to confirm, do you still have diagnostic.data for either of the reported events? If not, as Bruce mentioned above, if this happens again, it would be helpful to collect logs and diagnostic.data; that might have more information about the freeze.

Comment by Brian Granetzke [ 28/Dec/20 ]

@Bruce: I can send you logs from all three servers for additional freeze events if you would like data for more than the attached two.  The pattern is very similar for all of them.  Basically one of the secondaries gets a socket timeout for the primary and initiates an election.  My specific search string to quantify the situation is: msg="Starting an election, since we've seen no PRIMARY in election timeout period".  I can take the timestamp of the prior NetworkTimeout on the secondary and look on the primary and can see a distinct gap in log events at that time (give or take milliseconds).

Comment by Brian Granetzke [ 28/Dec/20 ]

I've uploaded the logs for the two invariant failure events.  The file names indicate the exact log time of the oldest invariant failure for the event.  In both cases, all three nodes ended up crashing with the same invariant failure.  The first event was on v4.4.1 and the second was on 4.4.2-ent.

Regarding the "pause", both events started with the primary pausing long enough for it to miss the heartbeat.  For the 12/19 event, the pause was on the mongo2 server between 2020-12-19T21:21:02.971 and 2020-12-19T21:21:11.893.  You'll notice based on log rate prior to the pause, a 9 second period with no logs is extremely unlikely.

The pause events are much more frequent than the invariant failures.  We've had 18 random primary failures since the upgrade to 4.4 (8-10 weeks ago).  The most recent was 12/23, but the diagnostic data has already rolled since that event as well.

Comment by Bruce Lucas (Inactive) [ 23/Dec/20 ]

Also the freeze that you mention is interesting. If you are willing it would be helpful if you could hold off on reverting to 4.2 until it happens one more time, then collect logs and diagnostic.data; that might have more information about the freeze.

Comment by Bruce Lucas (Inactive) [ 23/Dec/20 ]

b.granetzke@fetchrewards.com this comment you posted on SERVER-52735 seems especially interesting:

Also, twice, it crashed all three nodes on the cluster within seconds.

I think as you suggested logs in a 2 hour window would be useful, especially if you have logs for all 3 nodes that crashed simultaneously.

Comment by Brian Granetzke [ 23/Dec/20 ]

Unfortunately the metrics files have rolled since the most recent event (on the 19th).  The log files have rolled as well, but I can pull them from cloudwatch logs.  It is a lot of data.  Should I just send you 1 hour windows before/after each of the 3 events?

Comment by Dmitry Agranat [ 23/Dec/20 ]

Hi b.granetzke@fetchrewards.com, thank you for the report.

Would you please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location?

Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Thanks,
Dima

Comment by Brian Granetzke [ 22/Dec/20 ]

Actually, my issues looks a lot like SERVER-52735

Generated at Thu Feb 08 05:31:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.