[SERVER-17067] Burst of highly contended updates causes mongod with oplog to be irresponsive Created: 27/Jan/15  Updated: 09/Feb/15  Resolved: 29/Jan/15

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.0.0-rc6
Fix Version/s: 3.0.0-rc8

Type: Bug Priority: Critical - P2
Reporter: Chung-yen Chang Assignee: Susan LoVerso
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File IncFewSmallDoc.js     Text File cap-1442.log    
Issue Links:
Related
related to SERVER-17100 mongod hung running mongo-perf Closed
Tested
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

1. start mongod with oplog and run rs.initiate() in the shell.
2. Run the attached IncFewSmallDoc.js test.

Participants:

 Description   

Ran into this issue while running a mongo-perf test against wiredTiger. In a collection with 100 documents and a high number of concurrent threads (seem this with thread count > 80) updating them, throughput would drop significantly after 10 to 20 seconds. After this occured, commands like "show dbs" in the shell would not return.

Additional information observed so far.

  • Saw this in a single node replset. Without oplog, did not see this issue.
  • Only saw this in RC6. Did not see this in RC5.
  • Have been able to consistently reproduce this on a couple different machines.

mongostat output when this is happening. Notice how the throughput drop is corresponding to the very suspicious change in the "%dirty" and "%used" columns. Almost seems like some kind of overflow/underflow has occured.

insert query update delete getmore command % dirty % used flushes  vsize   res qr|qw ar|aw netIn netOut conn     set repl     time
    *0    *0     *0     *0       0     1|0     0.0    0.0       0 626.0M 46.0M   0|0   0|0   79b    16k    2 cap1442  PRI 04:07:26
   100    *0  15019     *0       0   114|0     0.0    0.0       0 723.0M 90.0M   0|0  0|75    2m    58k   83 cap1442  PRI 04:07:27
    *0    *0  30118     *0       0   303|0     0.0    0.0       0 740.0M 107.0M   0|0  0|76    4m    60k   83 cap1442  PRI 04:07:28
    *0    *0  33133     *0       0   328|0     0.0    0.0       0 752.0M 119.0M   0|0  0|74    5m    63k   83 cap1442  PRI 04:07:29
    *0    *0  33137     *0       0   334|0     0.0    0.0       0 761.0M 128.0M   0|0  0|71    5m    64k   83 cap1442  PRI 04:07:30
    *0    *0  28873     *0       0   286|0     0.0    0.0       0 767.0M 133.0M   0|0  0|70    4m    57k   83 cap1442  PRI 04:07:31
    *0    *0  31345     *0       0   321|0     0.0    0.0       0 776.0M 143.0M   0|0  0|77    4m    62k   83 cap1442  PRI 04:07:32
    *0    *0  31867     *0       0   318|0     0.0    0.0       0 785.0M 151.0M   0|0  0|64    4m    62k   83 cap1442  PRI 04:07:33
    *0    *0  33964     *0       0   345|0     0.0    0.0       0 795.0M 162.0M   0|0  0|76    5m    66k   83 cap1442  PRI 04:07:34
    *0    *0  32205     *0       0   323|0     0.1    0.1       0 799.0M 166.0M   0|0  0|80    4m    62k   83 cap1442  PRI 04:07:35
insert query update delete getmore command % dirty % used flushes  vsize    res qr|qw ar|aw netIn netOut conn     set repl     time
    *0    *0  32151     *0       0   325|0     0.1    0.1       0 806.0M 172.0M   0|0  0|70    4m    63k   83 cap1442  PRI 04:07:36
    *0    *0  30149     *0       0   301|0     0.0    0.0       0 810.0M 176.0M   0|0  0|62    4m    59k   83 cap1442  PRI 04:07:37
    *0    *0  33550     *0       0   338|0     0.0    0.0       0 810.0M 178.0M   0|0  0|69    5m    65k   83 cap1442  PRI 04:07:38
    *0    *0  33399     *0       0   332|0     0.0    0.0       0 811.0M 178.0M   0|0  0|53    5m    64k   83 cap1442  PRI 04:07:39
    *0    *0  30400     *0       0   307|0     0.1    0.1       0 811.0M 178.0M   0|0  0|71    4m    60k   83 cap1442  PRI 04:07:40
    *0    *0  31555     *0       0   316|0     0.0    0.0       0 813.0M 181.0M   0|0  0|69    4m    62k   83 cap1442  PRI 04:07:41
    *0    *0  30165     *0       0   309|0     0.0    0.0       0 820.0M 186.0M   0|0  0|66    4m    60k   83 cap1442  PRI 04:07:42
    *0    *0  33189     *0       0   326|0     0.1    0.1       0 829.0M 196.0M   0|0  0|74    5m    63k   83 cap1442  PRI 04:07:43
    *0    *0  32086     *0       0   324|0     0.1    0.1       0 841.0M 209.0M   0|0  0|65    4m    63k   83 cap1442  PRI 04:07:44
    *0    *0   2503     *0       0    22|0 29620464110.3 29620464110.3       0 866.0M 210.0M   0|0  0|80  347k    19k   83 cap1442  PRI 04:07:45
insert query update delete getmore command       % dirty        % used flushes  vsize    res qr|qw ar|aw netIn netOut conn     set repl     time
    *0    *0    948     *0       0    11|0 29620464110.3 29620464110.3       0 866.0M 210.0M   0|0  0|80  132k    18k   83 cap1442  PRI 04:07:46
    *0    *0    878     *0       0    12|0 29620464110.3 29620464110.3       0 866.0M 210.0M   0|0  0|80  122k    18k   83 cap1442  PRI 04:07:47
    *0    *0    987     *0       0    12|0 29620464110.3 29620464110.3       0 866.0M 210.0M   0|0  0|80  137k    18k   83 cap1442  PRI 04:07:48
    *0    *0    626     *0       0     7|0 29620464110.3 29620464110.3       0 866.0M 210.0M   0|0  0|80   87k    17k   83 cap1442  PRI 04:07:49
    *0    *0    356     *0       0     6|0 29620464110.3 29620464110.3       0 866.0M 210.0M   0|0  0|80   50k    17k   83 cap1442  PRI 04:07:50
    *0    *0     61     *0       0     1|0 29620464110.3 29620464110.3       0 866.0M 210.0M   0|0  0|80    8k    16k   83 cap1442  PRI 04:07:51
    *0    *0    306     *0       0     5|0 29620464110.3 29620464110.3       0 866.0M 210.0M   0|0  0|80   43k    17k   83 cap1442  PRI 04:07:52
    *0    *0    995     *0       0    10|0 29620464110.3 29620464110.3       0 866.0M 210.0M   0|0  0|80  138k    17k   83 cap1442  PRI 04:07:53
    *0    *0    415     *0       0     5|0 29620464110.3 29620464110.3       0 866.0M 210.0M   0|0  0|80   58k    17k   83 cap1442  PRI 04:07:54
    *0    *0     56     *0       0     3|0 29620464110.3 29620464110.3       0 866.0M 210.0M   0|0  0|80    8k    16k   83 cap1442  PRI 04:07:55

"top" output shows a correlation between the number of connections and the amount of CPU used by mongod. At the beginning of the tests when we pumped out high throughput, CPU utilization was very high. After the system got caught in this funny state with very low throughput, CPU start dropping quickly. Eventually, CPU utilization by mongod seems to settle down around 200% when the number of connections stayed at 4. Somehow we seem to have 3 connections that are stuck (in successful tests, the connection count drops to 1).

Tasks: 555 total,   1 running, 554 sleeping,   0 stopped,   0 zombie
%Cpu(s):  6.3 us,  0.0 sy,  0.0 ni, 93.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:  61406272 total,  1460576 used, 59945696 free,     6172 buffers
KiB Swap:        0 total,        0 used,        0 free.   529896 cached Mem
 
   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                             
 59083 ec2-user  20   0  836692 224424   6776 S 200.5  0.4  77:13.04 mongod                                                                                              
 59696 ec2-user  20   0  725296  15948   8280 S   0.7  0.0   0:00.24 mongo                                                                                               
 60570 ec2-user  20   0  123948   2056   1160 R   0.3  0.0   0:00.05 top        

It is very repeatable and I will put the repo steps and scripts in.



 Comments   
Comment by Daniel Pasette (Inactive) [ 29/Jan/15 ]

Resolved with https://github.com/wiredtiger/wiredtiger/pull/1616

Generated at Thu Feb 08 03:43:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.