[SERVER-61803] Multiple Primary node in Mongodb replicaset Created: 27/Nov/21  Updated: 05/Mar/22  Resolved: 03/Mar/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.4.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: sadegh raz Assignee: Edwin Zhou
Resolution: Done Votes: 0
Labels: mongodb, replica-set
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 18.04.4 LTS


Participants:

 Description   

Problem Statement/Rationale

Sometimes our Mongodb replicaset has multiple primary nodes! in this case mongo-b node is our real primary and mongo-c node is not responding and just have primary role in fake.

Number of nodes in replicaset: 4 (3 normal + 1 delay node without voting). network partition and access checked and there is no partion. we have had this problem several times since updating the cluster from version 4.2.5 to version 4.4.5. each time we fix the problem by cleaning datadir on fake node and restarting it.

Steps to Reproduce

it happens time to time. it doesn't have simple approach to Reproduce

Expected Results

cluster should have only one primary or produce better log about issue.

Actual Results

//                     {                                                                                                                [1881/1930]
                    "_id" : 5,                                       
                    "name" : "mongo-c:27017",                     
                    "health" : 1,
                    "state" : 1,
                    "stateStr" : "PRIMARY",
                    "uptime" : 3435,
                    "optime" : {
                            "ts" : Timestamp(1637525581, 2),
                            "t" : NumberLong(87)
                    },
                    "optimeDurable" : {
                            "ts" : Timestamp(1637525443, 1),
                            "t" : NumberLong(87)
                    },
                    "optimeDate" : ISODate("2021-11-21T20:13:01Z"),
                    "optimeDurableDate" : ISODate("2021-11-21T20:10:43Z"),
                    "lastHeartbeat" : ISODate("2021-11-24T07:08:32.108Z"),
                    "lastHeartbeatRecv" : ISODate("2021-11-24T07:08:34.867Z"),
                    "pingMs" : NumberLong(2355),
                    "lastHeartbeatMessage" : "",
                    "syncSourceHost" : "",
                    "syncSourceId" : -1,
                    "infoMessage" : "",
                    "electionTime" : Timestamp(1637524307, 1),
                    "electionDate" : ISODate("2021-11-21T19:51:47Z"),
                    "configVersion" : 23,
                    "configTerm" : 87
            },
            {
                    "_id" : 6,
                    "name" : "mongo-b:27017",
                    "health" : 1,
                    "state" : 1,
                    "stateStr" : "PRIMARY",
                    "uptime" : 7995003,
                    "optime" : {
                            "ts" : Timestamp(1637737714, 24),
                            "t" : NumberLong(88)
                    },
                    "optimeDate" : ISODate("2021-11-24T07:08:34Z"),
                    "syncSourceHost" : "",
                    "syncSourceId" : -1,
                    "infoMessage" : "",
                    "electionTime" : Timestamp(1637525581, 1),
                    "electionDate" : ISODate("2021-11-21T20:13:01Z"),
                    "configVersion" : 23,
                    "configTerm" : 88,
                    "self" : true,
                    "lastHeartbeatMessage" : ""
            }

Additional Notes

replica conf:

            {                                                                                                                   [25/599]
                    "_id" : 0,                                                                                                          
                    "host" : "mongo-a:27017",
                    "arbiterOnly" : false,
                    "buildIndexes" : true,
                    "hidden" : false,
                    "priority" : 1,
                    "tags" : {
 
                    },
                    "slaveDelay" : NumberLong(0),
                    "votes" : 1
            },
            {
                    "_id" : 3,
                    "host" : "mongo-h:27017",
                    "arbiterOnly" : false,
                    "buildIndexes" : true,
                    "hidden" : true,
                    "priority" : 0,
                    "tags" : {
 
                    },
                    "slaveDelay" : NumberLong(900),
                    "votes" : 0
            },
            {
                    "_id" : 5,
                    "host" : "mongo-c:27017",
                    "arbiterOnly" : false,
                    "buildIndexes" : true,
                    "hidden" : false,
                    "priority" : 1,
                    "tags" : {
 
                    },
                    "slaveDelay" : NumberLong(0),
                    "votes" : 1
            },
            {
                    "_id" : 6,
                    "host" : "mongo-b:27017",
                    "arbiterOnly" : false,
                    "buildIndexes" : true,
                    "hidden" : false,
                    "priority" : 1,
                    "tags" : {
 
                    },
                    "slaveDelay" : NumberLong(0),
                    "votes" : 1
            }



 Comments   
Comment by sadegh raz [ 05/Mar/22 ]

Hi Edwin. 2 weeks ago we updated the cluster to latest version as you suggested before. Now I'm waiting to see the issue persist or not. Thank you

 

 

 

Comment by Edwin Zhou [ 03/Mar/22 ]

Hi razzaghisaa@gmail.com,

We haven’t heard back from you for some time, so I’m going to close this ticket. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Best,
Edwin

Comment by Edwin Zhou [ 22/Feb/22 ]

Hi razzaghisaa@gmail.com,

We still need additional information to diagnose the problem. If this incident occurs again, can you supply the following data from both the former/stalled and current primary, ideally at the time when a new primary is elected?

  • FTDC from the current primary covering the time that the primary steps up
  • FTDC from the former/stalled primary covering the time that the current primary steps up
  • mongod logs from the current primary covering the time that the primary steps up
  • mongod logs from the former/stalled primary covering the time that the current primary steps up
  • gdb stack traces from former/stalled primary collected with the command I listed above as well as your local timezone.

Best,
Edwin

Comment by Edwin Zhou [ 10/Feb/22 ]

Hi razzaghisaa@gmail.com,

Thank you for following up. We look forward to analyzing the incident when you attach the diagnostics. Here is a new link to the support uploader as the previous one has expired.

Best,
Edwin

Comment by sadegh raz [ 02/Feb/22 ]

Hi Guys.

it happend again even after update to 4.4.10 .

profiling is enabled on all nodes.

i will send new statics next time iot happend.

Comment by Eric Sedor [ 31/Jan/22 ]

Hi razzaghisaa@gmail.com, are you able to provide the information my colleague Edwin requested?

Comment by Edwin Zhou [ 07/Jan/22 ]

Hi razzaghisaa@gmail.com,

Thank you for uploading gdb stacks and diagnostic data. Unfortunately the diagnostic data, logs, and gdb do not coincide within the same timeline, so it's unclear when the failover happened that led to a node incorrectly stuck in a primary state. Additionally, we also want all of the information you provided for the current primary from the former/stalled primary.

Here's the data you have provided so far:

  • FTDC from node _id 5 (current primary) from 2021-12-28T02:27:06.000Z to 2021-12-28T08:42:05.000Z (UTC)
  • mongod logs from node _id 5 from 2021-11-24T02:59:06.987Z to 2021-11-24T07:09:44.135Z (UTC)
  • gdb stack traces from 2021-12-11T11:16:45 (unknown timezone and node)

If this incident occurs again, can you supply the following data from both the former/stalled and current primary, ideally at the time when a new primary is elected?

  • FTDC from the current primary covering the time that the primary steps up
  • FTDC from the former/stalled primary covering the time that the current primary steps up
  • mongod logs from the current primary covering the time that the primary steps up
  • mongod logs from the former/stalled primary covering the time that the current primary steps up
  • gdb stack traces from former/stalled primary collected with the command I listed above as well as your local timezone.

I've also noticed that this repeated occurrence happened on MongoDB v4.4.5, which is exposed to SERVER-59226 which causes contention on the RSTL lock during database profiling. I recommend you upgrade to MongoDB v4.4.11 which has since fixed SERVER-59226.

Can you confirm whether you were profiling the database at the time of the stalled former primary?

Best,
Edwin

Comment by sadegh raz [ 28/Dec/21 ]

Hi @edwin.zhou

The problem recurred, so I uploaded Diag logs as well.

Regards.

Comment by sadegh raz [ 26/Dec/21 ]

Hi Edwin

 

Sorry for late answer. we were need some time to prepare the logs.

i uploaded logs and gdb files.

unfortunately diagnostic.data was deleted because retention policy but i can send it right after next time that the issue will happend again.

 

thanks.

Comment by Edwin Zhou [ 22/Dec/21 ]

Hi razzaghisaa@gmail.com,

We still need additional information to diagnose the problem. If this is still an issue for you, would you please let us know if you've come across this issue after upgrading to the latest version of 4.4, and if you've come across a repeat occurrence, can you please provide for us stack traces on the stalled primary?

Best,
Edwin

Comment by Edwin Zhou [ 06/Dec/21 ]

Hi razzaghisaa@gmail.com,

Thank you for your report. I'd first like to note that MongoDB version 4.4.5 is not recommended for production use due to critical issues and highly advise that you upgrade to the latest version of MongoDB v4.4.x and perform validate() on all collections.

Would you please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location?

Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

I suspect that this issue may be an occurrence of SERVER-59226 which causes contention on this lock during database profiling. This issue fixed in 4.2.18 and 4.4.11 which would explain why upgrading to 4.4.5 did not resolve this behavior.
Can you confirm whether you were profiling the database at this time? There may be a different behavior that exhibits this issue, but we need stack traces to confirm that this is the case.

If this behavior happens again, can you collect stack traces on the stalled nodes? The particular node is the primary that is stalled and unable to step down.

gdb -p $(pidof mongod) -batch -ex 'thread apply all bt' > gdb_`date +"%Y-%m-%d_%H-%M-%S"`.txt

Best,
Edwin

Generated at Thu Feb 08 05:53:21 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.