[SERVER-53747] replica set down after setFeatureCompatibilityVersion 4.4 Created: 13/Jan/21  Updated: 06/Dec/22  Resolved: 28/Jan/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.4.2
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Francesco Rivola Assignee: Backlog - Service Architecture
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File arbiter-44.png     PNG File image-2021-01-13-17-54-12-906.png     PNG File image-2021-01-14-13-07-05-310.png     PNG File image-2021-01-14-18-31-45-693.png     PNG File image-2021-01-14-18-32-15-793.png     PNG File image-2021-01-14-18-36-39-527.png     PNG File image-2021-01-20-10-00-43-525.png     PNG File image-2021-01-20-10-05-02-966.png     File mongod.log    
Issue Links:
Duplicate
duplicates SERVER-52735 mongodb crash with "Invariant failure... Closed
Related
related to SERVER-52735 mongodb crash with "Invariant failure... Closed
is related to SERVER-53566 Investigate and reproduce "opCtx != n... Closed
Assigned Teams:
Service Arch
Participants:
Case:

 Description   

Today we have experienced a major issue in our mongodb production database.

3 weeks ago we upgraded binaries from 4.2 to 4.4 in all nodes in the replica set.

Today, in order to complete the upgrade, we ran the following command from the primary (as mentioned in the official doc https://docs.mongodb.com/manual/release-notes/4.4-upgrade-replica-set/#feature-compatibility-version{{db.adminCommand( { setFeatureCompatibilityVersion: "4.4" } )}}

Right after execute the command all data-bearing nodes of our replica set got unhealthy. CPU spike to 100% in all data-bearing nodes.

Any attempt to establish a connection to the replica set end with a timeout or a rejection.

The mongod log shows this fatal error:

{"t":{"$date":"2021-01-13T11:50:17.983+00:00"},"s":"F",  "c":"-",        "id":23079,   "ctx":"TopologyVersionObserver","msg":"Invariant failure","attr":{"expr":"opCtx != nullptr && _opCtx == nullptr","file":"src/mongo/db/client.cpp","line":126}}
{"t":{"$date":"2021-01-13T11:50:17.985+00:00"},"s":"F",  "c":"-",        "id":23080,   "ctx":"TopologyVersionObserver","msg":"\n\n***aborting after invariant() failure\n\n"}
{"t":{"$date":"2021-01-13T11:50:17.985+00:00"},"s":"F",  "c":"CONTROL",  "id":4757800, "ctx":"TopologyVersionObserver","msg":"Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).\n"}}
{"t":{"$date":"2021-01-13T11:50:18.081+00:00"},"s":"I",  "c":"CONTROL",  "id":31431,   "ctx":"TopologyVersionObserver","msg":"BACKTRACE: {bt}","attr":{"bt":{"backtrace":[{"a":"55606F60F921","b":"55606C92C000","o":"2CE3921","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.606","s+":"1E1"},{"a":"55606F610F59","b":"55606C92C000","o":"2CE4F59","s":"_ZN5mongo15printStackTraceEv","s+":"29"},{"a":"55606F60E7B6","b":"55606C92C000","o":"2CE27B6","s":"_ZN5mongo12_GLOBAL__N_116abruptQuitActionEiP9siginfo_tPv","s+":"66"},{"a":"7EFF61C048A0","b":"7EFF61BF2000","o":"128A0","s":"funlockfile","s+":"50"},{"a":"7EFF6183FF47","b":"7EFF61801000","o":"3EF47","s":"gsignal","s+":"C7"},{"a":"7EFF618418B1","b":"7EFF61801000","o":"408B1","s":"abort","s+":"141"},{"a":"55606D840D8F","b":"55606C92C000","o":"F14D8F","s":"_ZN5mongo15invariantFailedEPKcS1_j","s+":"12C"},{"a":"55606D814458","b":"55606C92C000","o":"EE8458","s":"_ZN5mongo6Client19setOperationContextEPNS_16OperationContextE.cold.135","s+":"18"},{"a":"55606F4D1069","b":"55606C92C000","o":"2BA5069","s":"_ZN5mongo14ServiceContext20makeOperationContextEPNS_6ClientE","s+":"129"},{"a":"55606F4C5B17","b":"55606C92C000","o":"2B99B17","s":"_ZN5mongo6Client20makeOperationContextEv","s+":"27"},{"a":"55606DD474A1","b":"55606C92C000","o":"141B4A1","s":"_ZN5mongo4repl23TopologyVersionObserver17_workerThreadBodyEv","s+":"251"},{"a":"55606DD47B4C","b":"55606C92C000","o":"141BB4C","s":"_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5mongo4stdx6threadC4IZNS3_4repl23TopologyVersionObserver4initEPNS3_14ServiceContextEPNS7_22ReplicationCoordinatorEEUlvE0_JELi0EEET_DpOT0_EUlvE_EEEEE6_M_runEv","s+":"5C"},{"a":"55606F7B547F","b":"55606C92C000","o":"2E8947F","s":"execute_native_thread_routine","s+":"F"},{"a":"7EFF61BF96DB","b":"7EFF61BF2000","o":"76DB","s":"start_thread","s+":"DB"},{"a":"7EFF61922A3F","b":"7EFF61801000","o":"121A3F","s":"clone","s+":"3F"}],"processInfo":{"mongodbVersion":"4.4.2","gitVersion":"15e73dc5738d2278b688f8929aee605fe4279b0e","compiledModules":[],"uname":{"sysname":"Linux","release":"5.0.0-1036-azure","version":"#38-Ubuntu SMP Sun Mar 22 21:27:21 UTC 2020","machine":"x86_64"},"somap":[{"b":"55606C92C000","elfType":3,"buildId":"D18F657A1E06C333C2AEE534E3047044B0653DBF"},{"b":"7EFF61BF2000","path":"/lib/x86_64-linux-gnu/libpthread.so.0","elfType":3,"buildId":"BC3C06107774266C5F7DB3F1F380A3DA68AF90FA"},{"b":"7EFF61801000","path":"/lib/x86_64-linux-gnu/libc.so.6","elfType":3,"buildId":"D3CF764B2F97AC3EFE366DDD07AD902FB6928FD7"}]}}}}
{"t":{"$date":"2021-01-13T11:50:18.081+00:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"TopologyVersionObserver","msg":"  Frame: {frame}","attr":{"frame":{"a":"55606F60F921","b":"55606C92C000","o":"2CE3921","s":"_ZN5mongo18stack_trace_detail12_GLOBAL__N_119printStackTraceImplERKNS1_7OptionsEPNS_14StackTraceSinkE.constprop.606","s+":"1E1"}}}
{"t":{"$date":"2021-01-13T11:50:18.081+00:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"TopologyVersionObserver","msg":"  Frame: {frame}","attr":{"frame":{"a":"55606F610F59","b":"55606C92C000","o":"2CE4F59","s":"_ZN5mongo15printStackTraceEv","s+":"29"}}}
{"t":{"$date":"2021-01-13T11:50:18.081+00:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"TopologyVersionObserver","msg":"  Frame: {frame}","attr":{"frame":{"a":"55606F60E7B6","b":"55606C92C000","o":"2CE27B6","s":"_ZN5mongo12_GLOBAL__N_116abruptQuitActionEiP9siginfo_tPv","s+":"66"}}}
{"t":{"$date":"2021-01-13T11:50:18.081+00:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"TopologyVersionObserver","msg":"  Frame: {frame}","attr":{"frame":{"a":"7EFF61C048A0","b":"7EFF61BF2000","o":"128A0","s":"funlockfile","s+":"50"}}}
{"t":{"$date":"2021-01-13T11:50:18.081+00:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"TopologyVersionObserver","msg":"  Frame: {frame}","attr":{"frame":{"a":"7EFF6183FF47","b":"7EFF61801000","o":"3EF47","s":"gsignal","s+":"C7"}}}
{"t":{"$date":"2021-01-13T11:50:18.081+00:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"TopologyVersionObserver","msg":"  Frame: {frame}","attr":{"frame":{"a":"7EFF618418B1","b":"7EFF61801000","o":"408B1","s":"abort","s+":"141"}}}
{"t":{"$date":"2021-01-13T11:50:18.081+00:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"TopologyVersionObserver","msg":"  Frame: {frame}","attr":{"frame":{"a":"55606D840D8F","b":"55606C92C000","o":"F14D8F","s":"_ZN5mongo15invariantFailedEPKcS1_j","s+":"12C"}}}
{"t":{"$date":"2021-01-13T11:50:18.081+00:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"TopologyVersionObserver","msg":"  Frame: {frame}","attr":{"frame":{"a":"55606D814458","b":"55606C92C000","o":"EE8458","s":"_ZN5mongo6Client19setOperationContextEPNS_16OperationContextE.cold.135","s+":"18"}}}
{"t":{"$date":"2021-01-13T11:50:18.082+00:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"TopologyVersionObserver","msg":"  Frame: {frame}","attr":{"frame":{"a":"55606F4D1069","b":"55606C92C000","o":"2BA5069","s":"_ZN5mongo14ServiceContext20makeOperationContextEPNS_6ClientE","s+":"129"}}}
{"t":{"$date":"2021-01-13T11:50:18.082+00:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"TopologyVersionObserver","msg":"  Frame: {frame}","attr":{"frame":{"a":"55606F4C5B17","b":"55606C92C000","o":"2B99B17","s":"_ZN5mongo6Client20makeOperationContextEv","s+":"27"}}}
{"t":{"$date":"2021-01-13T11:50:18.082+00:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"TopologyVersionObserver","msg":"  Frame: {frame}","attr":{"frame":{"a":"55606DD474A1","b":"55606C92C000","o":"141B4A1","s":"_ZN5mongo4repl23TopologyVersionObserver17_workerThreadBodyEv","s+":"251"}}}
{"t":{"$date":"2021-01-13T11:50:18.082+00:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"TopologyVersionObserver","msg":"  Frame: {frame}","attr":{"frame":{"a":"55606DD47B4C","b":"55606C92C000","o":"141BB4C","s":"_ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN5mongo4stdx6threadC4IZNS3_4repl23TopologyVersionObserver4initEPNS3_14ServiceContextEPNS7_22ReplicationCoordinatorEEUlvE0_JELi0EEET_DpOT0_EUlvE_EEEEE6_M_runEv","s+":"5C"}}}
{"t":{"$date":"2021-01-13T11:50:18.082+00:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"TopologyVersionObserver","msg":"  Frame: {frame}","attr":{"frame":{"a":"55606F7B547F","b":"55606C92C000","o":"2E8947F","s":"execute_native_thread_routine","s+":"F"}}}
{"t":{"$date":"2021-01-13T11:50:18.082+00:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"TopologyVersionObserver","msg":"  Frame: {frame}","attr":{"frame":{"a":"7EFF61BF96DB","b":"7EFF61BF2000","o":"76DB","s":"start_thread","s+":"DB"}}}
{"t":{"$date":"2021-01-13T11:50:18.082+00:00"},"s":"I",  "c":"CONTROL",  "id":31427,   "ctx":"TopologyVersionObserver","msg":"  Frame: {frame}","attr":{"frame":{"a":"7EFF61922A3F","b":"7EFF61801000","o":"121A3F","s":"clone","s+":"3F"}}}
 
{"t":{"$date":"2021-01-13T12:17:39.998+00:00"},"s":"I",  "c":"CONTROL",  "id":20698,   "ctx":"main","msg":"***** SERVER RESTARTED *****"}
{"t":{"$date":"2021-01-13T12:17:40.002+00:00"},"s":"I",  "c":"CONTROL",  "id":23285,   "ctx":"main","msg":"Automatically disabling TLS 1.0, to force-enable TLS 1.0 specify --sslDisabledProtocols 'none'"}
{"t":{"$date":"2021-01-13T12:17:40.068+00:00"},"s":"W",  "c":"ASIO",     "id":22601,   "ctx":"main","msg":"No TransportLayer configured during NetworkInterface startup"}

Issues likely related I have found already reported are:

Attached relevant part of the mongod.log.

After several minutes the mongod process got killed, after restart all data bearing node we recover the cluster. Arbiter was not affected.

The featureCompatibilityVersion was not updated to 4.4.

db.adminCommand( { getParameter: 1, featureCompatibilityVersion: 1 } ).featureCompatibilityVersion.version
// 4.2 

More info:

  • Replica set is running 4.4.2 binaries
  • Replica set is composed by
    • 3 Data bearing node (1 with priority 0 and votes 0, 2 with priority 1 and votes 1)
    • 1 Arbiter
  • OS: Linux Ubuntu 18.0 LTS
  • Cloud: Azure

 

What should be the recommendation in this case to overcome the issue and being able to finalize the upgrade?

  • Can we try to run the FCV command in a secondary only first?
  • Is it known or expected such behavior after running the FCV command?


 Comments   
Comment by Francesco Rivola [ 20/Jan/21 ]

Hi @Bruce Lucas. That makes sense. I was not aware of this feature. Thank you for pointing it out.

Comment by Bruce Lucas (Inactive) [ 20/Jan/21 ]

francesco.rivola@gmail.com, the behavior you describe could be a result chained replication, with the getmores you are observing being getmores on the oplog due to replication.

Comment by Francesco Rivola [ 20/Jan/21 ]

Hi @Dmitry Agranat, we are looking forward to any updates on this.

I would like also to provide the following info that is may related, or may not to the above issue.

During the outage we stop our application, finally when the members of the replica crashed we started them again. RS3 was the first to be started as crashed first, so that was elected primary, then, when RS0 and RS1 become healthy again and in sync, we forced RS1 to become a primary member (this because this VM has a better hardware). Finally we started the application.

Reviewing our cluster metrics (using percona mongodb exporter) we have observed the following:

  • 200/300 per sec getmore and command operations started to show up in the RS3 secondary, and around the same number of operations decreased from the primary RS1 (you can see below the light blue line and green line)
  • Yesterday we tried to investigate what kind of operations was but in the db.currentOps we just seen mostly the new streameable isMaster operations. Mongostat on the secondary RS3 confirmed the metrics read from percona mongodb_exporter.
  • Yesterday we stop and restart RS3 gracefully. Then those 200/300 ops per seconds move back to the primary RS1 (you can see below the light blue line and green line)

Do you have any idea of what could cause this behavior? Is this some expected and documented?

Not sure is related with the crash or just an indirect consequence. Hope this helps in any way.

Thank you so much.

Cheers

Comment by Dmitry Agranat [ 18/Jan/21 ]

Hi francesco.rivola@gmail.com, thank you for providing the requested information. We're assigning this ticket to the appropriate team to be evaluated against our currently planned work. Updates will be posted on this ticket as they happen.

Comment by Francesco Rivola [ 14/Jan/21 ]

Hi @Dmitry Agranat. I have uploaded the zip file SERVER-53747.zip to the suggested location.

The zip file is organized as follow

Where each rsXX folder have three tar.gz files

  • diagnostic.data folder
  • mongod.log limited to the full day of the issue
  • core dump

Let me know if you need any further information or data.

Looking forward to know more info about the root cause of this issue.

Currently we have put in hold the set FCV to 4.4 in our production cluster.

Cheers

 

Comment by Dmitry Agranat [ 14/Jan/21 ]

Hi francesco.rivola@gmail.com, thank you for the provided information.

Would you please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) for all members in this replica set and upload them to this support uploader location?

Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Could you also upload the core dump into the same secure uploader? Since it is large, 3GB it might take some time to upload it but alternatively, you can split it into 10 smaller files with

split -n 10

command

Thanks,
Dima

Comment by Francesco Rivola [ 14/Jan/21 ]

Hi @Dmitry Agranat,

OS 5.0.0-1036-azure is the Ubuntu kernel shipped by Azure https://packages.ubuntu.com/bionic/linux-image-5.0.0-1036-azure.

Yes we have a core dump of the crash. Its size is 2912MB. This is the primary:

Arbiter has FCV to 4.4

Comment by Dmitry Agranat [ 13/Jan/21 ]

Hi francesco.rivola@gmail.com, thank toy for the report. Do you happen to have core dumps enabled on the server that ran into the Invariant failure? Could you also clarify what stands for the OS 5.0.0-1036-azure?

Generated at Thu Feb 08 05:31:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.