Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-21049

Primary experiences CPU spike under load leading to automatic primary switch

    XMLWordPrintableJSON

Details

    • Icon: Bug Bug
    • Resolution: Done
    • Icon: Critical - P2 Critical - P2
    • None
    • 3.0.4
    • MMAPv1
    • None
    • ALL
    • Hide

      Oddly there was nothing in the syslog of the primary (a01) itself.

      From the /var/adm/messages on newly elected primary:

      Oct 15 10:59:24 m02 kernel: kswapd0: page allocation failure. order:1, mode:0x20
      Oct 15 10:59:24 m02 kernel: Pid: 99, comm: kswapd0 Not tainted 2.6.32-504.23.4.el6.x86_64 #1
      Oct 15 10:59:24 m02 kernel: Call Trace:
      Oct 15 10:59:24 m02 kernel: <IRQ>  [<ffffffff811345aa>] ? __alloc_pages_nodemask+0x74a/0x8d0
      Oct 15 10:59:24 m02 kernel: [<ffffffff811735c2>] ? kmem_getpages+0x62/0x170
      Oct 15 10:59:24 m02 kernel: [<ffffffff811741da>] ? fallback_alloc+0x1ba/0x270
      Oct 15 10:59:24 m02 kernel: [<ffffffff81173c2f>] ? cache_grow+0x2cf/0x320
      Oct 15 10:59:24 m02 kernel: [<ffffffff81173f59>] ? ____cache_alloc_node+0x99/0x160
      Oct 15 10:59:24 m02 kernel: [<ffffffff81174d63>] ? kmem_cache_alloc+0x123/0x190
      Oct 15 10:59:24 spushdb-m02 kernel: [<ffffffff8144cd38>] ? sk_prot_alloc+0x48/0x1c0
      Oct 15 10:59:24 m02 kernel: [<ffffffff8144df62>] ? sk_clone+0x22/0x2e0
      Oct 15 10:59:24 m02 kernel: [<ffffffff814a2176>] ? inet_csk_clone+0x16/0xd0
      Oct 15 10:59:24 m02 kernel: [<ffffffff814bbca3>] ? tcp_create_openreq_child+0x23/0x470
      

      Limits on the host:

      cat limits
      Limit                     Soft Limit           Hard Limit           Units
      Max cpu time              unlimited            unlimited            seconds
      Max file size             unlimited            unlimited            bytes
      Max data size             unlimited            unlimited            bytes
      Max stack size            10485760             unlimited            bytes
      Max core file size        0                    unlimited            bytes
      Max resident set          unlimited            unlimited            bytes
      Max processes             65536                65536                processes
      Max open files            65536                65536                files
      Max locked memory         65536                65536                bytes
      Max address space         unlimited            unlimited            bytes
      

      Show
      Oddly there was nothing in the syslog of the primary (a01) itself. From the /var/adm/messages on newly elected primary: Oct 15 10:59:24 m02 kernel: kswapd0: page allocation failure. order:1, mode:0x20 Oct 15 10:59:24 m02 kernel: Pid: 99, comm: kswapd0 Not tainted 2.6.32-504.23.4.el6.x86_64 #1 Oct 15 10:59:24 m02 kernel: Call Trace: Oct 15 10:59:24 m02 kernel: <IRQ> [<ffffffff811345aa>] ? __alloc_pages_nodemask+0x74a/0x8d0 Oct 15 10:59:24 m02 kernel: [<ffffffff811735c2>] ? kmem_getpages+0x62/0x170 Oct 15 10:59:24 m02 kernel: [<ffffffff811741da>] ? fallback_alloc+0x1ba/0x270 Oct 15 10:59:24 m02 kernel: [<ffffffff81173c2f>] ? cache_grow+0x2cf/0x320 Oct 15 10:59:24 m02 kernel: [<ffffffff81173f59>] ? ____cache_alloc_node+0x99/0x160 Oct 15 10:59:24 m02 kernel: [<ffffffff81174d63>] ? kmem_cache_alloc+0x123/0x190 Oct 15 10:59:24 spushdb-m02 kernel: [<ffffffff8144cd38>] ? sk_prot_alloc+0x48/0x1c0 Oct 15 10:59:24 m02 kernel: [<ffffffff8144df62>] ? sk_clone+0x22/0x2e0 Oct 15 10:59:24 m02 kernel: [<ffffffff814a2176>] ? inet_csk_clone+0x16/0xd0 Oct 15 10:59:24 m02 kernel: [<ffffffff814bbca3>] ? tcp_create_openreq_child+0x23/0x470 Limits on the host: cat limits Limit Soft Limit Hard Limit Units Max cpu time unlimited unlimited seconds Max file size unlimited unlimited bytes Max data size unlimited unlimited bytes Max stack size 10485760 unlimited bytes Max core file size 0 unlimited bytes Max resident set unlimited unlimited bytes Max processes 65536 65536 processes Max open files 65536 65536 files Max locked memory 65536 65536 bytes Max address space unlimited unlimited bytes

    Description

      We have a 4 replica set cluster that we were in the process of upgrading from 2.6.8 to 3.0.4 and had to rollback due to reliability issues.

      After the switch the recently primary would be unresponsive for 5-10 mins and not always recover. In 1 case mongodb had to be restarted.

      Before event -> a01 (p), a02 (s), m01 (s), m02 (s)

      After event -> a02 (s), m01 (s), m02 (p)

      NOTE: one of the replica set members a02 is still on 2.6.8, rest on 3.0.4

      This was quite reproducible and occurred several times on different primaries. We have been running 2.6.8 for over a year with no such issues.

      We are using the MMAP storage engine, our goal was to complete the migration to WiredTiger but this was the first step in the migration.

      We are considering upgrading to 3.0.6 but wanted to get some insight into the issue before we go down that path.

      Attachments

        Activity

          People

            Unassigned Unassigned
            arjuntaneja Arjun Taneja
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: