[SERVER-61723] Two primary nodes found in one replica set. Created: 24/Nov/21 Updated: 10/Jun/22 Resolved: 22/Dec/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 4.2.17 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Zijun Tian | Assignee: | Edwin Zhou |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Operating System: | ALL |
| Participants: |
| Description |
|
We have a replica set of 5 nodes. From the log, the primary node was handling some aggregation which caused the CPU utilization to 100% in a short time. And after 10 seconds of no-respond from primary, an election started and a new primary was elected. The result sent to the previous primary node, and the log showed the previous primary stepped down and changed the state to secondary. However, the state did not change due to an unknown reason, and when we use `rs.status()` command on any node in the cluster, we can find two primary nodes at the same time (although other 3 secondary nodes sync to the new primary) As a result, some users using PyMongo to connect to the cluster met with connection issues while some users did not. I guess it's because some users connected to the wrong primary node (the previous one) We tried to remove the previous primary and added it back, there would still be two primary nodes. We had to reboot the previous primary and added it back to the cluster, this time it turned to be rollback state, after several minutes, it became a secondary. |
| Comments |
| Comment by Edwin Zhou [ 22/Dec/21 ] | |
|
I haven't heard from you in a while so I will now close this ticket. Please let us know if you run into this issue again after attempting to profile the database on the latest version of MongoDB v4.2 and are able to collect stack traces. Best, | |
| Comment by Edwin Zhou [ 06/Dec/21 ] | |
|
Thank you for following up and for your effort in this investigation. Knowing that profiling was active during this time heavily narrows this behavior down to Best, | |
| Comment by Zijun Tian [ 02/Dec/21 ] | |
|
Yes, we set the profiling level to 2 at the time. And I will collect the trace if it happens again. Thanks! | |
| Comment by Edwin Zhou [ 02/Dec/21 ] | |
|
In my initial look at this diagnostic data, we are indeed seeing nodes appearing as two primaries. There appears to be contention on the ReplicationStateTransitionLock on the node that is stepping down. It's possible that this is an occurrence of Can you confirm whether you were profiling the database at this time? There may be a different behavior that exhibits this issue, but we need stack traces to confirm that this is the case. If you area able to reproduce this behavior, or if this behavior happens again, can you collect stack traces on the stalled nodes, that is, the primary that is stalled and is unable to step down.
Best, | |
| Comment by Edwin Zhou [ 30/Nov/21 ] | |
|
Thank you again for following up. I'm confirming that I've received your diagnostic data files. Best, | |
| Comment by Zijun Tian [ 30/Nov/21 ] | |
|
Hi Edwin, It seems like I only replace the file one. I have replaced two instances and uploaded them again. Please have a check, thanks! | |
| Comment by Edwin Zhou [ 30/Nov/21 ] | |
|
I apologize for the inconvenience; however, the data still isn't in the portal. Can you confirm that you're replacing the two instances of <filename> in the curl command? One instance is in -F attributes and the second instance is in -F file. Thanks for your patience. Best, | |
| Comment by Zijun Tian [ 30/Nov/21 ] | |
|
Hi Edwin, I uploaded it again, it's a zip file called "mongo.zip" in size of 167MB. Can you confirm that? Thanks! | |
| Comment by Edwin Zhou [ 30/Nov/21 ] | |
|
Thank you for following up. Unfortunately I don't see that the diagnostic data has been uploaded to the upload portal. Can you try again and confirm? Best, | |
| Comment by Zijun Tian [ 30/Nov/21 ] | |
|
Hi Edwin, I've uploaded them. | |
| Comment by Edwin Zhou [ 29/Nov/21 ] | |
|
Thank you for following up and I apologize for the confusion. The metrics files that you see are indeed the ftdc files I previously referred to. You may upload all of the log files and diagnostic.data directory for every node. Best, | |
| Comment by Zijun Tian [ 29/Nov/21 ] | |
|
Hi Edwin, I did not see any ftdc files under the diagnostic.data folder. The only file missing is "metrcis.interim". Will it be ok that I just upload all log & diagnostic.data of the replica set of 5 nodes? | |
| Comment by Edwin Zhou [ 29/Nov/21 ] | |
|
Thank you for uploading the mongod log file and the timestamp of the incident; however, it's unclear which node this log file is associated with. Can you please also upload the log files from the remaining nodes and the FTDC data from the $dbpath/diagnostic.data directory (the contents are described here) for the entire replicaset? Best, | |
| Comment by Zijun Tian [ 29/Nov/21 ] | |
|
Hi @Edwin Zhou I uploaded the files as a zip file "mongo.zip". The incident happened on Nov.23 21:21:00 (UTC). | |
| Comment by Edwin Zhou [ 29/Nov/21 ] | |
|
Would you please archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location? Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time. Best, | |
| Comment by Zijun Tian [ 24/Nov/21 ] | |
|
TSMongo:PRIMARY> rs.status() , , , , , , , my last applied OpTime: { ts: Timestamp(1634249503, 26), t: 319 }", , , , , , , , , , , , }, |