[SERVER-21049] Primary experiences CPU spike under load leading to automatic primary switch Created: 21/Oct/15 Updated: 22/Oct/15 Resolved: 22/Oct/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | MMAPv1 |
| Affects Version/s: | 3.0.4 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Arjun Taneja | Assignee: | Unassigned |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
|||||||||||||||||||||||||
| Operating System: | ALL | |||||||||||||||||||||||||
| Steps To Reproduce: | Oddly there was nothing in the syslog of the primary (a01) itself. From the /var/adm/messages on newly elected primary:
Limits on the host:
|
|||||||||||||||||||||||||
| Participants: |
| Description |
|
We have a 4 replica set cluster that we were in the process of upgrading from 2.6.8 to 3.0.4 and had to rollback due to reliability issues. After the switch the recently primary would be unresponsive for 5-10 mins and not always recover. In 1 case mongodb had to be restarted. Before event -> a01 (p), a02 (s), m01 (s), m02 (s) After event -> a02 (s), m01 (s), m02 (p) NOTE: one of the replica set members a02 is still on 2.6.8, rest on 3.0.4 This was quite reproducible and occurred several times on different primaries. We have been running 2.6.8 for over a year with no such issues. We are using the MMAP storage engine, our goal was to complete the migration to WiredTiger but this was the first step in the migration. We are considering upgrading to 3.0.6 but wanted to get some insight into the issue before we go down that path. |
| Comments |
| Comment by Arjun Taneja [ 22/Oct/15 ] |
|
Also, we used the same replica set hosts with 2.6.8. If you need more Thanks, On Thu, Oct 22, 2015 at 1:49 PM, Arjun Taneja <arjun.taneja@teamaol.com> |
| Comment by Arjun Taneja [ 22/Oct/15 ] |
|
If you follow the trail reported there. The page allocation failure was on We can try 3.0.6 and give you more information if needed. Which specific On Thu, Oct 22, 2015 at 1:29 PM, Ramon Fernandez (JIRA) <jira@mongodb.org> |
| Comment by Ramon Fernandez Marina [ 22/Oct/15 ] |
|
arjuntaneja, unfortunately there's not enough information to determine what the issue is. The "page allocation failure" message indicates an OS failure to allocate memory, and it could be legitimate (low memory conditions) or it could be a linux kernel bug (e.g: https://access.redhat.com/solutions/90883). When the OS can't allocate memory it is entirely possible that some subsystems (e.g.: networking) and/or process (e.g.: mongod) could be come unresponsive, in which case an election could be triggered. In that respect it seems that the replica set worked as expected and elected a new primary. The the SERVER project is for reporting bugs or feature suggestions for the MongoDB server, and since there's not evidence to indicate a bug I'm going to close this ticket. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag, where your question will reach a larger audience. A question like this involving more discussion would be best posted on the mongodb-user group. See also our Technical Support page for additional support resources. Regards, |