[SERVER-48962] Member in replica set switches between primary and secondary every few minutes Created: 18/Jun/20 Updated: 20/Jul/20 Resolved: 20/Jul/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 4.0.18, 4.0.19 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Vladimir Dimitrov | Assignee: | Dmitry Agranat |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Operating System: | ALL |
| Participants: |
| Description |
|
We have a Primary-Secondary-Arbiter setup with version 4.0.18 (upgraded to 4.0.19 but still the problem persists) which started experiencing some very strange behavior. The three nodes are called Osiril, Horus and Isis (Arbiter). Every few minutes (from 2 to 10 min looking at the log) we get this in the log:
And this repeats forever. It looks like some timer is killing the connection with the other replica and this causes the switch between primary and secondary but we have no way to diagnose this. This setup has been running for about 10 years now. The only thing we have changed is to move the virtual machine Horus to another server hardware which is faster. This problem causes all clients to break the connection every few minutes and in some cases crashes some services. Attached are the log files from all servers.
|
| Comments |
| Comment by Vladimir Dimitrov [ 20/Jul/20 ] |
|
Hi Dmitry, I doubt the issue somehow was magically resolved, but can't provide you with logs rom the past that were automatically cleared by MongoDB de to it's default behavior to do so. If you are unable to see the issue or do anything about it using the logs I have already provided please fell free to close the issue. |
| Comment by Dmitry Agranat [ 20/Jul/20 ] |
|
We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide additional information and we will reopen the ticket. Regards, |
| Comment by Dmitry Agranat [ 07/Jul/20 ] |
|
Since we do not have mongod logs covering the time of this event, we recommend implementing log archiving. Once we'll be able to review the logs covering the time of this event, we'll be able to progress with diagnosing this issue. Thanks, |
| Comment by Vladimir Dimitrov [ 26/Jun/20 ] |
|
Hi Carl. Looks like MongoDB overwrites the log file every time it is restarted. I don't see any other log files that I can send you. |
| Comment by Carl Champain (Inactive) [ 25/Jun/20 ] |
|
The logs for "Osiris" start at 2020-06-15T14:51:38.334+0300, whereas the logs for "Isis" start at 2020-06-15T21:39:05.013+0300 and the logs for "Horus" start at 2020-06-15T18:53:36.852+0300. Can you please provide the logs for "Isis" and "Horus" starting at around 2020-06-15T14:51:38.334+0300 (same time as "Osiris")? Thank you, |
| Comment by Vladimir Dimitrov [ 23/Jun/20 ] |
|
I am not sure what caused the problem and what fixed it. The actual problem was that MongoDB was doing something every minute and on an even minute interval was killing it's own connection with the other replica. This was causing a switch in the PRIMARY/SECONDARY state every 1 or 2 or 2 ... or 10 minutes which was resulting in all client connections being killed every time the switch happened. I have no idea what exacly was causing this problem but even if the time was no exactly the same on both replica machines this should not be an expected behavior. |
| Comment by Carl Champain (Inactive) [ 23/Jun/20 ] |
|
Thank you for the additional information, vlad.dimitrov@gmail.com! From your previous comment it seems that the issue came from outside of MongoDB. Can you please help clarify what you think the bug is with MongoDB? |
| Comment by Vladimir Dimitrov [ 22/Jun/20 ] |
|
Hi Carl. Attached is the logs from one of the server limited to one day since these files even compressed became about 200 MB.
What I can add is that I somehow managed to stop this from happening. The database is actually a VM running on ESXi host. As I stated in the description this is a new host that I moved the VM to. For some reason the ESXi had a completely messed up timestamp (one day back) and I changed the NTP servers that ESXi used which fixed the time and this issue. Note that the VM actually had correct time. I saw that there is one line repeated hundreds of times:
Which led me to look at the times everywhere. Not sure how this change fixed the situation or if it is now fixed, but it looks stable at the moment. Let me know if you need more information to diagnose the issue. As of the moment I have transitioned the replica set to a primary, secondary, secondary but the logs are not cleared. |
| Comment by Carl Champain (Inactive) [ 22/Jun/20 ] |
|
Thank you for the report. Kind regards, |