[SERVER-53747] replica set down after setFeatureCompatibilityVersion 4.4 Created: 13/Jan/21 Updated: 06/Dec/22 Resolved: 28/Jan/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 4.4.2 |
| Fix Version/s: | None |
| Type: | Question | Priority: | Major - P3 |
| Reporter: | Francesco Rivola | Assignee: | Backlog - Service Architecture |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Assigned Teams: |
Service Arch
|
||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||
| Description |
|
Today we have experienced a major issue in our mongodb production database. 3 weeks ago we upgraded binaries from 4.2 to 4.4 in all nodes in the replica set. Today, in order to complete the upgrade, we ran the following command from the primary (as mentioned in the official doc https://docs.mongodb.com/manual/release-notes/4.4-upgrade-replica-set/#feature-compatibility-version) {{db.adminCommand( { setFeatureCompatibilityVersion: "4.4" } )}} Right after execute the command all data-bearing nodes of our replica set got unhealthy. CPU spike to 100% in all data-bearing nodes.
Any attempt to establish a connection to the replica set end with a timeout or a rejection. The mongod log shows this fatal error:
Issues likely related I have found already reported are: Attached relevant part of the mongod.log. After several minutes the mongod process got killed, after restart all data bearing node we recover the cluster. Arbiter was not affected. The featureCompatibilityVersion was not updated to 4.4.
More info:
What should be the recommendation in this case to overcome the issue and being able to finalize the upgrade?
|
| Comments |
| Comment by Francesco Rivola [ 20/Jan/21 ] | |
|
Hi @Bruce Lucas. That makes sense. I was not aware of this feature. Thank you for pointing it out. | |
| Comment by Bruce Lucas (Inactive) [ 20/Jan/21 ] | |
|
francesco.rivola@gmail.com, the behavior you describe could be a result chained replication, with the getmores you are observing being getmores on the oplog due to replication. | |
| Comment by Francesco Rivola [ 20/Jan/21 ] | |
|
Hi @Dmitry Agranat, we are looking forward to any updates on this. I would like also to provide the following info that is may related, or may not to the above issue. During the outage we stop our application, finally when the members of the replica crashed we started them again. RS3 was the first to be started as crashed first, so that was elected primary, then, when RS0 and RS1 become healthy again and in sync, we forced RS1 to become a primary member (this because this VM has a better hardware). Finally we started the application. Reviewing our cluster metrics (using percona mongodb exporter) we have observed the following:
Do you have any idea of what could cause this behavior? Is this some expected and documented? Not sure is related with the crash or just an indirect consequence. Hope this helps in any way. Thank you so much. Cheers | |
| Comment by Dmitry Agranat [ 18/Jan/21 ] | |
|
Hi francesco.rivola@gmail.com, thank you for providing the requested information. We're assigning this ticket to the appropriate team to be evaluated against our currently planned work. Updates will be posted on this ticket as they happen. | |
| Comment by Francesco Rivola [ 14/Jan/21 ] | |
|
Hi @Dmitry Agranat. I have uploaded the zip file
The zip file is organized as follow
Where each rsXX folder have three tar.gz files
Let me know if you need any further information or data. Looking forward to know more info about the root cause of this issue. Currently we have put in hold the set FCV to 4.4 in our production cluster. Cheers
| |
| Comment by Dmitry Agranat [ 14/Jan/21 ] | |
|
Hi francesco.rivola@gmail.com, thank you for the provided information. Would you please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) for all members in this replica set and upload them to this support uploader location? Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time. Could you also upload the core dump into the same secure uploader? Since it is large, 3GB it might take some time to upload it but alternatively, you can split it into 10 smaller files with
Thanks, | |
| Comment by Francesco Rivola [ 14/Jan/21 ] | |
|
Hi @Dmitry Agranat, OS 5.0.0-1036-azure is the Ubuntu kernel shipped by Azure https://packages.ubuntu.com/bionic/linux-image-5.0.0-1036-azure. Yes we have a core dump of the crash. Its size is 2912MB. This is the primary:
Arbiter has FCV to 4.4
| |
| Comment by Dmitry Agranat [ 13/Jan/21 ] | |
|
Hi francesco.rivola@gmail.com, thank toy for the report. Do you happen to have core dumps enabled on the server that ran into the Invariant failure? Could you also clarify what stands for the OS 5.0.0-1036-azure? |