[SERVER-21049] Primary experiences CPU spike under load leading to automatic primary switch Created: 21/Oct/15  Updated: 22/Oct/15  Resolved: 22/Oct/15

Status: Closed
Project: Core Server
Component/s: MMAPv1
Affects Version/s: 3.0.4
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Arjun Taneja Assignee: Unassigned
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Mongodb Host CPU busy.png    
Operating System: ALL
Steps To Reproduce:

Oddly there was nothing in the syslog of the primary (a01) itself.

From the /var/adm/messages on newly elected primary:

Oct 15 10:59:24 m02 kernel: kswapd0: page allocation failure. order:1, mode:0x20
Oct 15 10:59:24 m02 kernel: Pid: 99, comm: kswapd0 Not tainted 2.6.32-504.23.4.el6.x86_64 #1
Oct 15 10:59:24 m02 kernel: Call Trace:
Oct 15 10:59:24 m02 kernel: <IRQ>  [<ffffffff811345aa>] ? __alloc_pages_nodemask+0x74a/0x8d0
Oct 15 10:59:24 m02 kernel: [<ffffffff811735c2>] ? kmem_getpages+0x62/0x170
Oct 15 10:59:24 m02 kernel: [<ffffffff811741da>] ? fallback_alloc+0x1ba/0x270
Oct 15 10:59:24 m02 kernel: [<ffffffff81173c2f>] ? cache_grow+0x2cf/0x320
Oct 15 10:59:24 m02 kernel: [<ffffffff81173f59>] ? ____cache_alloc_node+0x99/0x160
Oct 15 10:59:24 m02 kernel: [<ffffffff81174d63>] ? kmem_cache_alloc+0x123/0x190
Oct 15 10:59:24 spushdb-m02 kernel: [<ffffffff8144cd38>] ? sk_prot_alloc+0x48/0x1c0
Oct 15 10:59:24 m02 kernel: [<ffffffff8144df62>] ? sk_clone+0x22/0x2e0
Oct 15 10:59:24 m02 kernel: [<ffffffff814a2176>] ? inet_csk_clone+0x16/0xd0
Oct 15 10:59:24 m02 kernel: [<ffffffff814bbca3>] ? tcp_create_openreq_child+0x23/0x470

Limits on the host:

cat limits
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            10485760             unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             65536                65536                processes
Max open files            65536                65536                files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes

Participants:

 Description   

We have a 4 replica set cluster that we were in the process of upgrading from 2.6.8 to 3.0.4 and had to rollback due to reliability issues.

After the switch the recently primary would be unresponsive for 5-10 mins and not always recover. In 1 case mongodb had to be restarted.

Before event -> a01 (p), a02 (s), m01 (s), m02 (s)

After event -> a02 (s), m01 (s), m02 (p)

NOTE: one of the replica set members a02 is still on 2.6.8, rest on 3.0.4

This was quite reproducible and occurred several times on different primaries. We have been running 2.6.8 for over a year with no such issues.

We are using the MMAP storage engine, our goal was to complete the migration to WiredTiger but this was the first step in the migration.

We are considering upgrading to 3.0.6 but wanted to get some insight into the issue before we go down that path.



 Comments   
Comment by Arjun Taneja [ 22/Oct/15 ]

Also, we used the same replica set hosts with 2.6.8. If you need more
information we'll be happy to provide it.

Thanks,
-Arjun.

On Thu, Oct 22, 2015 at 1:49 PM, Arjun Taneja <arjun.taneja@teamaol.com>

Comment by Arjun Taneja [ 22/Oct/15 ]

If you follow the trail reported there. The page allocation failure was on
the secondary and not the primary that got switched out. Also, the primary
never recovered which would warrant an investigation in the least.

We can try 3.0.6 and give you more information if needed. Which specific
logs would help with this?

On Thu, Oct 22, 2015 at 1:29 PM, Ramon Fernandez (JIRA) <jira@mongodb.org>

Comment by Ramon Fernandez Marina [ 22/Oct/15 ]

arjuntaneja, unfortunately there's not enough information to determine what the issue is. The "page allocation failure" message indicates an OS failure to allocate memory, and it could be legitimate (low memory conditions) or it could be a linux kernel bug (e.g: https://access.redhat.com/solutions/90883).

When the OS can't allocate memory it is entirely possible that some subsystems (e.g.: networking) and/or process (e.g.: mongod) could be come unresponsive, in which case an election could be triggered. In that respect it seems that the replica set worked as expected and elected a new primary.

The the SERVER project is for reporting bugs or feature suggestions for the MongoDB server, and since there's not evidence to indicate a bug I'm going to close this ticket. For MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag, where your question will reach a larger audience. A question like this involving more discussion would be best posted on the mongodb-user group. See also our Technical Support page for additional support resources.

Regards,
Ramón.

Generated at Thu Feb 08 03:56:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.