[SERVER-33350] Primary switches to Secondary because a new term has begun Created: 15/Feb/18 Updated: 27/Oct/23 Resolved: 16/Jun/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 3.2.18 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Darshan Shah | Assignee: | Arnie Listhaus |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Operating System: | ALL |
| Participants: |
| Description |
|
After upgrading to 3.2.18 (from 3.2.11) on RHEL7.1, we intermittently see Primary switching to backup because a new term has begun. |
| Comments |
| Comment by Kelsey Schubert [ 16/Jun/18 ] | |||||||||||||||||||
|
Hi darshan.shah@interactivedata.com, It appears that these nodes were still running MongoDB 3.2.11, as mentioned earlier, we strongly recommend upgrading to a more recent version. An election occurred because the secondary could not communicate with the current primary which suffered significant performance degradation, following the election it recognized that a new primary had been elected and stepped down as expected. In this case, the failover was expected, and I don't see anything to indicate a bug in this specific behavior. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag. A question like this involving more discussion would be best posted on the mongodb-users group. Regards, | |||||||||||||||||||
| Comment by Darshan Shah [ 17/May/18 ] | |||||||||||||||||||
|
Observed this today - even though the original primary is caught up, it is not stepping up to become primary. On the original Primary that stepped down:
On the Secondary which transitioned to Primary:
| |||||||||||||||||||
| Comment by Darshan Shah [ 12/May/18 ] | |||||||||||||||||||
|
I have uploaded the requested files for 3 shards where we saw this problem yesterday. Another issue is we noticed is that mongos still thinks that the original primary is the one serving the data which causes the application to stall/fail as it does not get any results for queries in this situation. So I have uploaded the mongos logs as well. So rather than primary switching, the mongos not knowing how to handle this situation is the actual problem. Please check and let me know if you need any other info to look into this problem. Thanks, | |||||||||||||||||||
| Comment by Darshan Shah [ 11/May/18 ] | |||||||||||||||||||
|
I am working on getting all the required logs and info as this problem occured again last night. Also, can you please explain what you mean by your comment: Thanks, | |||||||||||||||||||
| Comment by Arnie Listhaus [ 27/Apr/18 ] | |||||||||||||||||||
|
Hi Darshan, Thank you for attaching the logs and diagnostic data. Here is what I see based on what you have collected. Actual Primary Log Backup Primary Log Diagnostic Data Although we will need consistent logs and diagnostic data to do a root cause analysis, I did want to call your attention to the fact that your servers are showing signs of stalling as can be seen in the chart below: As you can see, at 18:13 UTC (14:13 EDT) on ...-17, there was a significant surge in documents returned from queries and at the peak there was a stall for 50 seconds. When the server returned, there was no memory allocated indicating that this server likely crashed and was restarted. The log you provided for c4z-pt-pdoas-17, starts at 2018-03-21T14:16:38.221-0400 which is consistent with the chart below: This shows that ...-17 had an initial failure at 18:13 UTC and then again at 18:15:56 UTC. Unfortunately, I do not have those logs and therefore can not determine the cause of the failures. If you would like help in evaluating these issues further, please provide a consistent set of logs consisting of: Also, if you still have log files for ...-17 covering the period of issues reported above, please review and/or provide those so we can see if there are any indications as to why that server may have crashed. Finally, 3.2.11 is actually quite old at this point and there have been many bug fixes/improvements made in later versions of 3.2 as well as 3.4 and 3.6. I would strongly encourage you to upgrade to at least the latest version of 3.2 which is currently 3.2.19. Thanks, | |||||||||||||||||||
| Comment by Darshan Shah [ 10/Apr/18 ] | |||||||||||||||||||
|
Hi Kelsey, I have uploaded a tar zip file to the secure link provided by you. Thanks, | |||||||||||||||||||
| Comment by Darshan Shah [ 22/Mar/18 ] | |||||||||||||||||||
|
Hi Kelsey, Now we see this happening sometimes in a cluster running MongoDB 3.2.11 as well along with Mongos Invariant Failure . Is it possible for you to download the files from a secure link that I provide instead of me uploading the files to your secure link on Amazon? Due to security restrictions I cannot access the secure link provided by you nor upload to this JIRA. Thanks, | |||||||||||||||||||
| Comment by Kelsey Schubert [ 20/Mar/18 ] | |||||||||||||||||||
|
Hi darshan.shah@interactivedata.com, Thanks for providing these snippets. However, for us to continue to investigate, we'll also require the diagnostic.data from both nodes. Would you please upload an archive of these directories to the portal I provided? Thank you, | |||||||||||||||||||
| Comment by Darshan Shah [ 19/Mar/18 ] | |||||||||||||||||||
|
Hi Kelsey, I have attached the anonymized log snippets from the time this had happened. Thanks, | |||||||||||||||||||
| Comment by Darshan Shah [ 27/Feb/18 ] | |||||||||||||||||||
|
Hi, I am working on getting the required info. FYI - Just to confirm whether this was related to the upgrade to version 3.2.18 - we downgraded back to 3.2.11 and the problem has not recurred since. Thanks, | |||||||||||||||||||
| Comment by Kelsey Schubert [ 16/Feb/18 ] | |||||||||||||||||||
|
Hi darshan.shah@interactivedata.com, So we can investigate this issue, would you please upload an archive diagnostic.data and complete logs for each member of the affected replica set? I've created a secure portal for you to provide these files. Thank you, |