Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-17067

Burst of highly contended updates causes mongod with oplog to be irresponsive

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Critical - P2 Critical - P2
    • 3.0.0-rc8
    • Affects Version/s: 3.0.0-rc6
    • Component/s: WiredTiger
    • None
    • Fully Compatible
    • ALL
    • Hide

      1. start mongod with oplog and run rs.initiate() in the shell.
      2. Run the attached IncFewSmallDoc.js test.

      Show
      1. start mongod with oplog and run rs.initiate() in the shell. 2. Run the attached IncFewSmallDoc.js test.

      Ran into this issue while running a mongo-perf test against wiredTiger. In a collection with 100 documents and a high number of concurrent threads (seem this with thread count > 80) updating them, throughput would drop significantly after 10 to 20 seconds. After this occured, commands like "show dbs" in the shell would not return.

      Additional information observed so far.

      • Saw this in a single node replset. Without oplog, did not see this issue.
      • Only saw this in RC6. Did not see this in RC5.
      • Have been able to consistently reproduce this on a couple different machines.

      mongostat output when this is happening. Notice how the throughput drop is corresponding to the very suspicious change in the "%dirty" and "%used" columns. Almost seems like some kind of overflow/underflow has occured.

      insert query update delete getmore command % dirty % used flushes  vsize   res qr|qw ar|aw netIn netOut conn     set repl     time
          *0    *0     *0     *0       0     1|0     0.0    0.0       0 626.0M 46.0M   0|0   0|0   79b    16k    2 cap1442  PRI 04:07:26
         100    *0  15019     *0       0   114|0     0.0    0.0       0 723.0M 90.0M   0|0  0|75    2m    58k   83 cap1442  PRI 04:07:27
          *0    *0  30118     *0       0   303|0     0.0    0.0       0 740.0M 107.0M   0|0  0|76    4m    60k   83 cap1442  PRI 04:07:28
          *0    *0  33133     *0       0   328|0     0.0    0.0       0 752.0M 119.0M   0|0  0|74    5m    63k   83 cap1442  PRI 04:07:29
          *0    *0  33137     *0       0   334|0     0.0    0.0       0 761.0M 128.0M   0|0  0|71    5m    64k   83 cap1442  PRI 04:07:30
          *0    *0  28873     *0       0   286|0     0.0    0.0       0 767.0M 133.0M   0|0  0|70    4m    57k   83 cap1442  PRI 04:07:31
          *0    *0  31345     *0       0   321|0     0.0    0.0       0 776.0M 143.0M   0|0  0|77    4m    62k   83 cap1442  PRI 04:07:32
          *0    *0  31867     *0       0   318|0     0.0    0.0       0 785.0M 151.0M   0|0  0|64    4m    62k   83 cap1442  PRI 04:07:33
          *0    *0  33964     *0       0   345|0     0.0    0.0       0 795.0M 162.0M   0|0  0|76    5m    66k   83 cap1442  PRI 04:07:34
          *0    *0  32205     *0       0   323|0     0.1    0.1       0 799.0M 166.0M   0|0  0|80    4m    62k   83 cap1442  PRI 04:07:35
      insert query update delete getmore command % dirty % used flushes  vsize    res qr|qw ar|aw netIn netOut conn     set repl     time
          *0    *0  32151     *0       0   325|0     0.1    0.1       0 806.0M 172.0M   0|0  0|70    4m    63k   83 cap1442  PRI 04:07:36
          *0    *0  30149     *0       0   301|0     0.0    0.0       0 810.0M 176.0M   0|0  0|62    4m    59k   83 cap1442  PRI 04:07:37
          *0    *0  33550     *0       0   338|0     0.0    0.0       0 810.0M 178.0M   0|0  0|69    5m    65k   83 cap1442  PRI 04:07:38
          *0    *0  33399     *0       0   332|0     0.0    0.0       0 811.0M 178.0M   0|0  0|53    5m    64k   83 cap1442  PRI 04:07:39
          *0    *0  30400     *0       0   307|0     0.1    0.1       0 811.0M 178.0M   0|0  0|71    4m    60k   83 cap1442  PRI 04:07:40
          *0    *0  31555     *0       0   316|0     0.0    0.0       0 813.0M 181.0M   0|0  0|69    4m    62k   83 cap1442  PRI 04:07:41
          *0    *0  30165     *0       0   309|0     0.0    0.0       0 820.0M 186.0M   0|0  0|66    4m    60k   83 cap1442  PRI 04:07:42
          *0    *0  33189     *0       0   326|0     0.1    0.1       0 829.0M 196.0M   0|0  0|74    5m    63k   83 cap1442  PRI 04:07:43
          *0    *0  32086     *0       0   324|0     0.1    0.1       0 841.0M 209.0M   0|0  0|65    4m    63k   83 cap1442  PRI 04:07:44
          *0    *0   2503     *0       0    22|0 29620464110.3 29620464110.3       0 866.0M 210.0M   0|0  0|80  347k    19k   83 cap1442  PRI 04:07:45
      insert query update delete getmore command       % dirty        % used flushes  vsize    res qr|qw ar|aw netIn netOut conn     set repl     time
          *0    *0    948     *0       0    11|0 29620464110.3 29620464110.3       0 866.0M 210.0M   0|0  0|80  132k    18k   83 cap1442  PRI 04:07:46
          *0    *0    878     *0       0    12|0 29620464110.3 29620464110.3       0 866.0M 210.0M   0|0  0|80  122k    18k   83 cap1442  PRI 04:07:47
          *0    *0    987     *0       0    12|0 29620464110.3 29620464110.3       0 866.0M 210.0M   0|0  0|80  137k    18k   83 cap1442  PRI 04:07:48
          *0    *0    626     *0       0     7|0 29620464110.3 29620464110.3       0 866.0M 210.0M   0|0  0|80   87k    17k   83 cap1442  PRI 04:07:49
          *0    *0    356     *0       0     6|0 29620464110.3 29620464110.3       0 866.0M 210.0M   0|0  0|80   50k    17k   83 cap1442  PRI 04:07:50
          *0    *0     61     *0       0     1|0 29620464110.3 29620464110.3       0 866.0M 210.0M   0|0  0|80    8k    16k   83 cap1442  PRI 04:07:51
          *0    *0    306     *0       0     5|0 29620464110.3 29620464110.3       0 866.0M 210.0M   0|0  0|80   43k    17k   83 cap1442  PRI 04:07:52
          *0    *0    995     *0       0    10|0 29620464110.3 29620464110.3       0 866.0M 210.0M   0|0  0|80  138k    17k   83 cap1442  PRI 04:07:53
          *0    *0    415     *0       0     5|0 29620464110.3 29620464110.3       0 866.0M 210.0M   0|0  0|80   58k    17k   83 cap1442  PRI 04:07:54
          *0    *0     56     *0       0     3|0 29620464110.3 29620464110.3       0 866.0M 210.0M   0|0  0|80    8k    16k   83 cap1442  PRI 04:07:55
      

      "top" output shows a correlation between the number of connections and the amount of CPU used by mongod. At the beginning of the tests when we pumped out high throughput, CPU utilization was very high. After the system got caught in this funny state with very low throughput, CPU start dropping quickly. Eventually, CPU utilization by mongod seems to settle down around 200% when the number of connections stayed at 4. Somehow we seem to have 3 connections that are stuck (in successful tests, the connection count drops to 1).

      Tasks: 555 total,   1 running, 554 sleeping,   0 stopped,   0 zombie
      %Cpu(s):  6.3 us,  0.0 sy,  0.0 ni, 93.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
      KiB Mem:  61406272 total,  1460576 used, 59945696 free,     6172 buffers
      KiB Swap:        0 total,        0 used,        0 free.   529896 cached Mem
      
         PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                             
       59083 ec2-user  20   0  836692 224424   6776 S 200.5  0.4  77:13.04 mongod                                                                                              
       59696 ec2-user  20   0  725296  15948   8280 S   0.7  0.0   0:00.24 mongo                                                                                               
       60570 ec2-user  20   0  123948   2056   1160 R   0.3  0.0   0:00.05 top        
      

      It is very repeatable and I will put the repo steps and scripts in.

        1. cap-1442.log
          2.22 MB
          Chung-yen Chang
        2. IncFewSmallDoc.js
          1 kB
          Chung-yen Chang

            Assignee:
            sue.loverso@mongodb.com Susan LoVerso
            Reporter:
            chung-yen.chang Chung-yen Chang
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated:
              Resolved: