[SERVER-21691] Insert stalls during dirty writeback Created: 30/Nov/15  Updated: 30/Mar/16  Resolved: 01/Dec/15

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 3.2.0-rc4
Fix Version/s: 3.2.0-rc5

Type: Bug Priority: Blocker - P1
Reporter: Martin Bligh Assignee: Michael Cahill (Inactive)
Resolution: Done Votes: 0
Labels: WTplaybook
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: HTML File benchRun     File metrics.2015-11-30T14-44-15Z-00000     PNG File stalls.png     PNG File stalls.png    
Issue Links:
Depends
is depended on by WT-1973 MongoDB changes for WiredTiger 2.7.0 Closed
is depended on by SERVER-21686 WiredTiger changes for 3.2.0-rc5 Closed
Related
is related to SERVER-21752 slow2_wt fails by exhausting host mac... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

Standalone mongodb instance, 24CPU, 32GB RAM
benchRun 32 false true 100000
(see attached script)

Participants:

 Description   

With the resolution of SERVER-21652, things are definitely much improved. However, I still see some heavy stalls during what appears to be dirty writeback. Maybe checkpoints, but not time-aligned as I'd expect?



 Comments   
Comment by Martin Bligh [ 07/Dec/15 ]

sue.loverso Only if I put journal and data on the same device, otherwise it looks OK

Comment by Susan LoVerso [ 07/Dec/15 ]

martin.bligh are you still seeing WT log stalls? Do you have data/stats from that? A large number of slot join races implies that we're waiting for a free slot. That means we may not be writing them out to the OS timely enough or we're stuck waiting for something.

Comment by Martin Bligh [ 01/Dec/15 ]

dan@10gen.com Nope. fixing ... looking much better now. Do we still want to address that, or close this back out?

Comment by Bruce Lucas (Inactive) [ 01/Dec/15 ]

Also, FTDC data please? Want to check if it's similar to something Ramon showed me yesterday, involving a large number of slot join races during the slowdowns.

Comment by Daniel Pasette (Inactive) [ 01/Dec/15 ]

Are. You running with journal on separate volume?

Comment by Martin Bligh [ 01/Dec/15 ]

If I restart with --nojournal, the problem goes away.

Comment by Martin Bligh [ 01/Dec/15 ]

-   5.05%  mongod  mongod               [.] __wt_row_modify                                        
   - __wt_row_modify                                                                               
      - 99.67% __wt_btcur_insert                                                                   
         - __curfile_insert                                                                        
            + 53.12% mongo::WiredTigerIndexUnique::_insert(__wt_cursor*, mongo::BSONObj const&, mon
            + 46.88% mongo::WiredTigerRecordStore::insertRecords(mongo::OperationContext*, std::vec
-   4.66%  mongod  [kernel.kallsyms]    [k] __schedule                                             
   - __schedule                                                                                    
      - 96.80% sys_sched_yield                                                                     
           system_call_fastpath                                                                    
           __sched_yield                                                                           
           __wt_log_write                                                                          
           __wt_txn_commit                                                                         
           __session_commit_transaction                                                            
           mongo::WiredTigerRecoveryUnit::_txnClose(bool)                                          
           mongo::WiredTigerRecoveryUnit::_commit()                                                
           mongo::insertMultiVector(mongo::OperationContext*, mongo::OldClientContext&, bool, char 
           mongo::insertMulti(mongo::OperationContext*, mongo::OldClientContext&, bool, char const*
           mongo::_receivedInsert(mongo::OperationContext*, mongo::NamespaceString const&, char con
           mongo::receivedInsert(mongo::OperationContext*, mongo::NamespaceString const&, mongo::Me
           mongo::assembleResponse(mongo::OperationContext*, mongo::Message&, mongo::DbResponse&, m
           mongo::MyMessageHandler::process(mongo::Message&, mongo::AbstractMessagingPort*)        
           mongo::PortMessageServer::handleIncomingMsg(void*)                                      
           start_thread                                                                            
      + 3.00% system_call_fastpath                                                                 
-   4.42%  mongod  mongod               [.] __wt_log_slot_join                                     
   - __wt_log_slot_join                                                                            
      - 99.96% __wt_log_write                                                                      
           __wt_txn_commit                                                                         
           __session_commit_transaction                                                            
           mongo::WiredTigerRecoveryUnit::_txnClose(bool)                                          
           mongo::WiredTigerRecoveryUnit::_commit()                                                
           mongo::insertMultiVector(mongo::OperationContext*, mongo::OldClientContext&, bool, char 
           mongo::insertMulti(mongo::OperationContext*, mongo::OldClientContext&, bool, char const*
           mongo::_receivedInsert(mongo::OperationContext*, mongo::NamespaceString const&, char con
           mongo::receivedInsert(mongo::OperationContext*, mongo::NamespaceString const&, mongo::Me
           mongo::assembleResponse(mongo::OperationContext*, mongo::Message&, mongo::DbResponse&, m
           mongo::MyMessageHandler::process(mongo::Message&, mongo::AbstractMessagingPort*)        
           mongo::PortMessageServer::handleIncomingMsg(void*)                                      
           start_thread                                                                            
-   4.41%  mongod  [kernel.kallsyms]    [k] put_prev_task_fair                                     
   - put_prev_task_fair                                                                            
      - 93.89% __schedule                                                                          
         - 99.98% sys_sched_yield                                                                  
              system_call_fastpath                                                                 
            - __sched_yield                                                                        
               - 99.98% __wt_log_write                                                             
                    __wt_txn_commit                                                                
                    __session_commit_transaction                                                   
                    mongo::WiredTigerRecoveryUnit::_txnClose(bool)                                 
                    mongo::WiredTigerRecoveryUnit::_commit()                                       
                    mongo::insertMultiVector(mongo::OperationContext*, mongo::OldClientContext&, bo
                    mongo::insertMulti(mongo::OperationContext*, mongo::OldClientContext&, bool, ch
                    mongo::_receivedInsert(mongo::OperationContext*, mongo::NamespaceString const&,
                    mongo::receivedInsert(mongo::OperationContext*, mongo::NamespaceString const&, 
                    mongo::assembleResponse(mongo::OperationContext*, mongo::Message&, mongo::DbRes
                    mongo::MyMessageHandler::process(mongo::Message&, mongo::AbstractMessagingPort*
                    mongo::PortMessageServer::handleIncomingMsg(void*)                             
                    start_thread                                                                   
      + 6.09% sys_sched_yield                  

Comment by Martin Bligh [ 01/Dec/15 ]

I'm still getting large stalls

localhost:27017 1463356    *0     *0     *0       0   150|0    71.1   78.4       0  1.4G 1.2G   0|2  0|27  739m    33k   34       09:52:10
localhost:27017 1471816    *0     *0     *0       0   148|0    89.4   91.5       0  1.4G 1.2G   0|2  0|30  744m    33k   34       09:52:11
localhost:27017 1682940    *0     *0     *0       0   169|0    83.6   85.7       0  1.4G 1.2G   0|0  0|32  850m    35k   34       09:52:12
localhost:27017 1642508    *0     *0     *0       0   164|0    70.3   72.2       0  1.4G 1.2G   0|3  0|29  830m    34k   34       09:52:13
localhost:27017 1717792    *0     *0     *0       0   179|0    82.4   84.4       0  1.4G 1.2G   0|1  0|29  868m    35k   34       09:52:14
localhost:27017  173536    *0     *0     *0       0    11|0    14.4   16.3       0  1.4G 1.2G   0|0  0|32   88m    19k   34       09:52:15
localhost:27017      *0    *0     *0     *0       0     1|0    14.4   16.3       0  1.4G 1.2G   0|0  0|32   79b    19k   34       09:52:16
localhost:27017      *0    *0     *0     *0       0     1|0    14.4   16.3       0  1.4G 1.2G   0|0  0|32   79b    19k   34       09:52:17
localhost:27017      *0    *0     *0     *0       0     1|0    14.4   16.3       0  1.4G 1.2G   0|0  0|32   79b    19k   34       09:52:18
localhost:27017 1049744    *0     *0     *0       0   112|0    80.0   82.0       0  1.4G 1.2G   0|0  0|30  531m    29k   34       09:52:19
                 insert query update delete getmore command % dirty % used flushes vsize  res qr|qw ar|aw netIn netOut conn time
localhost:27017 1597220    *0     *0     *0       0   160|0    90.0   91.9       0  1.4G 1.2G   0|2  0|29  807m    34k   34       09:52:20
localhost:27017 1528928    *0     *0     *0       0   155|0    68.1   78.9       0  1.4G 1.2G   0|0  0|31  772m    33k   34       09:52:21
localhost:27017 1539116    *0     *0     *0       0   155|0    66.8   68.2       0  1.4G 1.2G   0|0  0|26  778m    33k   34       09:52:22
localhost:27017 1719500    *0     *0     *0       0   174|0    87.4   89.7       0  1.4G 1.2G   0|0  0|32  869m    35k   34       09:52:23
localhost:27017      *0    *0     *0     *0       0     2|0    15.6   16.9       0  1.4G 1.2G   0|0  0|32  133b    19k   34       09:52:24
localhost:27017      *0    *0     *0     *0       0     1|0    15.6   16.9       0  1.4G 1.2G   0|0  0|32   79b    19k   34       09:52:25
localhost:27017      *0    *0     *0     *0       0     1|0    15.6   16.9       0  1.4G 1.2G   0|0  0|32   79b    19k   34       09:52:26
localhost:27017      *0    *0     *0     *0       0     1|0    15.6   16.9       0  1.4G 1.2G   0|0  0|32   79b    19k   34       09:52:27
localhost:27017  101056    *0     *0     *0       0     7|0    22.2   23.5       0  1.4G 1.2G   0|1  0|30   51m    19k   34       09:52:28
localhost:27017 1678336    *0     *0     *0       0   174|0    76.3   77.6       0  1.4G 1.2G   0|1  0|28  848m    35k   34       09:52:29
                 insert query update delete getmore command % dirty % used flushes vsize  res qr|qw ar|aw netIn netOut conn time
localhost:27017  698680    *0     *0     *0       0    66|0    34.5   35.6       0  1.4G 1.2G   0|0  0|32  353m    25k   34       09:52:30
localhost:27017 1452456    *0     *0     *0       0   148|0    74.2   75.3       0  1.4G 1.2G   0|3  0|27  734m    33k   34       09:52:31
localhost:27017 1025588    *0     *0     *0       0   103|0    64.9   66.0       0  1.4G 1.2G   0|0  0|32  519m    28k   34       09:52:32
localhost:27017      *0    *0     *0     *0       0     1|0    33.7   34.7       0  1.4G 1.2G   0|0  0|32   79b    19k   34       09:52:33
localhost:27017      *0    *0     *0     *0       0     2|0    33.7   34.7       0  1.4G 1.2G   0|0  0|32  133b    19k   34       09:52:34
localhost:27017      *0    *0     *0     *0       0     1|0    33.7   34.7       0  1.4G 1.2G   0|0  0|32   79b    19k   34       09:52:35
localhost:27017      *0    *0     *0     *0       0     1|0    33.7   34.7       0  1.4G 1.2G   0|0  0|32   79b    19k   34       09:52:36
localhost:27017      *0    *0     *0     *0       0     1|0    33.7   34.7       0  1.4G 1.2G   0|0  0|32   79b    19k   34       09:52:37
localhost:27017      *0    *0     *0     *0       0     1|0    33.7   34.7       0  1.4G 1.2G   0|0  0|32   79b    19k   34       09:52:38
localhost:27017      *0    *0     *0     *0       0     2|0    33.7   34.7       0  1.4G 1.2G   0|0  0|32  133b    19k   34       09:52:39
                 insert query update delete getmore command % dirty % used flushes vsize  res qr|qw ar|aw netIn netOut conn time
localhost:27017 1325248    *0     *0     *0       0   137|0    70.1   71.1       0  1.4G 1.2G   0|0  0|27  669m    32k   34       09:52:40
localhost:27017 1694372    *0     *0     *0       0   170|0    82.2   83.1       0  1.4G 1.2G   0|0  0|31  856m    35k   34       09:52:41
localhost:27017 1666300    *0     *0     *0       0   170|0    94.2   95.1       0  1.4G 1.2G   0|1  0|30  842m    35k   34       09:52:42
localhost:27017 1634836    *0     *0     *0       0   159|0    82.1   83.0       0  1.4G 1.2G   0|1  0|29  826m    34k   34       09:52:43
localhost:27017 1672456    *0     *0     *0       0   169|0    73.2   74.1       0  1.4G 1.2G   0|1  0|29  845m    34k   34       09:52:44
localhost:27017 1675736    *0     *0     *0       0   170|0    82.0   82.8       0  1.4G 1.2G   0|0  0|30  847m    35k   34       09:52:45
localhost:27017 1576848    *0     *0     *0       0   159|0    91.8   93.6       0  1.4G 1.2G   0|0  0|30  797m    34k   34       09:52:46
localhost:27017 1585000    *0     *0     *0       0   161|0    86.1   87.0       0  1.4G 1.2G   0|2  0|28  801m    34k   34       09:52:47
localhost:27017 1510916    *0     *0     *0       0   153|0    69.8   70.7       0  1.4G 1.2G   0|1  0|29  764m    33k   34       09:52:48
localhost:27017 1675080    *0     *0     *0       0   168|0    81.1   82.8       0  1.4G 1.2G   0|0  0|29  847m    34k   34       09:52:49
                 insert query update delete getmore command % dirty % used flushes vsize  res qr|qw ar|aw netIn netOut conn time
localhost:27017 1631108    *0     *0     *0       0   163|0    86.8   87.7       0  1.4G 1.2G   0|0  0|32  824m    34k   34       09:52:50
localhost:27017 1512148    *0     *0     *0       0   152|0    72.7   73.5       0  1.4G 1.2G   0|0  0|31  764m    33k   34       09:52:51
localhost:27017 1550116    *0     *0     *0       0   157|0    75.7   76.5       0  1.4G 1.2G   0|0  0|32  784m    33k   34       09:52:52
localhost:27017 1594156    *0     *0     *0       0   161|0    89.0   89.7       0  1.4G 1.2G   0|0  0|32  805m    34k   34       09:52:53
localhost:27017      *0    *0     *0     *0       0     2|0    23.9   24.5       0  1.4G 1.2G   0|0  0|32  133b    19k   34       09:52:54
localhost:27017      *0    *0     *0     *0       0     1|0    23.9   24.5       0  1.4G 1.2G   0|0  0|32   79b    19k   34       09:52:55
localhost:27017 1189956    *0     *0     *0       0   120|0    86.6   87.3       0  1.4G 1.2G   0|0  0|30  601m    30k   34       09:52:56
localhost:27017 1636408    *0     *0     *0       0   165|0    92.7   93.3       0  1.4G 1.2G   0|1  0|30  827m    34k   34       09:52:57

Comment by Githook User [ 01/Dec/15 ]

Author:

{u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

Message: Import wiredtiger-wiredtiger-mongodb-3.2-rc4-56-g7a4f325.tar.gz from wiredtiger branch mongodb-3.2

ref: 8326df6..7a4f325

e731ef8 WT-2251 Free addresses when discarding deleted page references.
0e93d60 SERVER-21691 Avoid insert stalls
264ec21 WT-2249 Keep eviction stuck until cache usage is under 100%.
dca1411 WT-2250 Minor fix: use SET instead of INCRV for stat.
Branch: master
https://github.com/mongodb/mongo/commit/86e7b69a6c52c926d28a60d816faefa6db81eb96

Comment by Geert Bosch [ 01/Dec/15 ]

We're vastly better indeed now, and this is indeed an extreme workload, so sounds like a plan. I attached a ftdc plot FYI. The perf data was a two-second sample during one of the reduced throughput cases.

Comment by Michael Cahill (Inactive) [ 01/Dec/15 ]

geert.bosch, I suggest we open a 3.3 required ticket for the remaining stalls: my focus today was RC5.

In my testing I saw lower numbers overall than you but with less variation. perf output isn't very enlightening unless it can focus on those single seconds: there could be various factors in play, let's start with ftdc data and take it from there.

Comment by Geert Bosch [ 01/Dec/15 ]

This seems to take care of the worst stalls and longer periods of (near) inactivity, but there are still 1-2 second stalls:

1095228    *0     *0     *0       0   112|0       0 36.3 KB 36.0 KB Database:0.0%   0|0  0|29  553m    28k   35 23:29:54
 542124    *0     *0     *0       0    57|0       0 36.3 KB 36.0 KB Metadata:0.0%   0|0  0|32  274m    23k   35 23:29:55
     *0    *0     *0     *0       0     1|0       0 36.3 KB 36.0 KB Metadata:0.0%   0|0  0|32   79b    17k   35 23:29:56
 957720    *0     *0     *0       0    95|0       0 36.3 KB 36.0 KB Collection:0.0% 0|0  0|29  483m    26k   35 23:29:57
1107056    *0     *0     *0       0   112|0       0 36.3 KB 36.0 KB Metadata:0.0%   0|0  0|31  559m    28k   35 23:29:58
1038656    *0     *0     *0       0   103|0       0 36.3 KB 36.0 KB Metadata:0.0%   0|0  0|31  524m    27k   35 23:29:59
 insert query update delete getmore command flushes   vsize     res     locked db qr|qw ar|aw netIn netOut conn     time
1092360    *0     *0     *0       0   115|0       0 36.3 KB 36.0 KB Database:0.0%   0|0  0|30  552m    28k   35 23:30:00
1088476    *0     *0     *0       0   108|0       0 36.3 KB 36.0 KB Collection:0.0% 0|0  0|30  549m    28k   35 23:30:01
1100572    *0     *0     *0       0   114|0       0 36.3 KB 36.0 KB Metadata:0.0%   0|0  0|31  556m    28k   35 23:30:02
1082972    *0     *0     *0       0   110|0       0 36.3 KB 36.0 KB Metadata:0.0%   0|0  0|30  547m    28k   35 23:30:03
 675520    *0     *0     *0       0    66|0       0 36.3 KB 36.1 KB Metadata:0.0%   0|0  0|32  341m    24k   35 23:30:04
 164604    *0     *0     *0       0    17|0       0 36.3 KB 36.1 KB Metadata:0.0%   0|0  0|32   83m    19k   35 23:30:05
    800    *0     *0     *0       0     1|0       0 36.3 KB 36.1 KB Metadata:0.0%   0|0  0|32  404k    17k   35 23:30:06
 974796    *0     *0     *0       0    99|0       0 36.3 KB 36.1 KB Database:0.0%   0|0  0|30  492m    27k   35 23:30:07
1091244    *0     *0     *0       0   110|0       0 36.3 KB 36.1 KB Collection:0.0% 0|0  0|30  551m    28k   35 23:30:08

perf top output:

-  18.28%           mongod  mongod               [.] __wt_row_modify                                                            
   - __wt_row_modify                                                                                                            
      - 99.75% __wt_btcur_insert                                                                                                
         - __curfile_insert                                                                                                     
            + 50.84% mongo::WiredTigerIndexUnique::_insert(__wt_cursor*, mongo::BSONObj const&, mongo::RecordId const&, bool)   
            + 49.16% mongo::WiredTigerRecordStore::insertRecords(mongo::OperationContext*, std::vector<mongo::Record, std::alloc
-   4.83%           mongod  mongod               [.] __wt_cache_eviction_check.part.7.constprop.16                              
   - __wt_cache_eviction_check.part.7.constprop.16                                                                              
      + 98.93% __curfile_insert                                                                                                 
      + 1.07% __wt_btcur_insert                                                                                                 
+   3.60%           mongod  mongod               [.] __wt_search_insert                                                         
+   3.22%           mongod  mongod               [.] __wt_row_search                                                            
+   2.42%           mongod  mongod               [.] __wt_hazard_set                                                            
+   1.90%           mongod  mongod               [.] __rec_txn_read                                                             
+   1.56%           mongod  libc-2.19.so         [.] __memcpy_sse2_unaligned                                                    

Comment by Githook User [ 01/Dec/15 ]

Author:

{u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexander.gorrod@mongodb.com'}

Message: Merge pull request #2349 from wiredtiger/SERVER-21691

SERVER-21691 Avoid insert stalls
Branch: develop
https://github.com/wiredtiger/wiredtiger/commit/0e93d60d0d86795b6d14187080bdcd5e8502c140

Comment by Githook User [ 01/Dec/15 ]

Author:

{u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexander.gorrod@mongodb.com'}

Message: Merge pull request #2349 from wiredtiger/SERVER-21691

SERVER-21691 Avoid insert stalls
Branch: develop
https://github.com/wiredtiger/wiredtiger/commit/0e93d60d0d86795b6d14187080bdcd5e8502c140

Comment by Githook User [ 01/Dec/15 ]

Author:

{u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

Message: SERVER-21691 Don't retry eviction of a page if transaction state hasn't changed.

This check was moved out of __wt_page_can_evict, which meant it only applied to LRU eviction. Move it back so that we don't repeatedly try forced eviction that has no chance of succeeding.
Branch: develop
https://github.com/wiredtiger/wiredtiger/commit/cf62c714ced9cd518baf1f52c4295650b53e69d2

Comment by Githook User [ 01/Dec/15 ]

Author:

{u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

Message: SERVER-21691 Avoid blocking in-memory splits.

Only trylock the page's reconciliation lock (so in-memory splits aren't blocked by slow splits of siblings), don't give up as soon as a checkpoint starts.
Branch: develop
https://github.com/wiredtiger/wiredtiger/commit/8cb4ecce11762679a07f632c1ff3d7a7732ef023

Comment by Geert Bosch [ 30/Nov/15 ]

I can reproduce this, low throughput (often 0 .. 64 ops/sec) lasts for about 30 seconds before bouncing back up.

-  16.99%  mongod  [.] __rec_txn_read                                                                                                                                        
   - __rec_txn_read                                                                                                                                                          
      - 99.69% __rec_row_leaf_insert                                                                                                                                         
           __rec_row_leaf                                                                                                                                                    
         - __wt_reconcile                                                                                                                                                    
            - 74.17% __wt_cache_op                                                                                                                                           
                 __checkpoint_worker.isra.9                                                                                                                                  
                 __checkpoint_apply                                                                                                                                          
                 __txn_checkpoint                                                                                                                                            
                 __wt_txn_checkpoint                                                                                                                                         
                 __session_checkpoint                                                                                                                                        
                 __ckpt_server                                                                                                                                               
                 start_thread                                                                                                                                                
            + 25.83% __wt_evict                                                                                                                                              
-  14.65%  mongod  [.] __wt_row_modify                                                                                                                                       
   - __wt_row_modify                                                                                                                                                         
      + 89.73% __wt_btcur_insert                                                                                                                                             
      + 10.17% __split_multi_inmem                                                                                                                                           
-  11.79%  mongod  [.] __wt_cksum_hw                                                                                                                                         
   - __wt_cksum_hw                                                                                                                                                           
      + 86.39% __rec_split_write                                                                                                                                             
      + 12.83% __wt_block_write_off                                                                                                                                          
      + 0.78% __wt_log_write                                                                                                                                                 
-   9.35%  mongod  [.] snappy::internal::CompressFragment(char const*, unsigned long, char*, unsigned short*, int)                                                           
   - snappy::internal::CompressFragment(char const*, unsigned long, char*, unsigned short*, int)                                                                             
      + 99.95% snappy::Compress(snappy::Source*, snappy::Sink*)                                                                                                              
+   3.51%  mongod  [.] __wt_cache_eviction_check.part.7.constprop.16                                                                                                         
+   3.10%  mongod  [.] __wt_search_insert                                                                                                                                    
+   2.89%  mongod  [.] tc_free                                                                                                                                               
+   2.86%  mongod  [.] __wt_row_search                                                                                                                                       
+   2.57%  mongod  [.] __wt_tree_walk                                                                                                                                        
+   1.95%  mongod  [.] __free_skip_list                                                                                                                                      
+   1.93%  mongod  [.] __wt_free_update_list                                                                                                                                 
+   1.90%  mongod  [.] __wt_hazard_set                                                                                                                                       
+   1.48%  mongod  [.] __rec_cell_build_leaf_key                                                                                                                             
+   1.38%  mongod  [.] __wt_btcur_insert                                                                                                                                     
+   1.20%  mongod  [.] __rec_row_leaf_insert                                                                                                                                 
+   1.16%  mongod  [.] tcmalloc::CentralFreeList::FetchFromOneSpans(int, void**, void**)                                                                                     
+   1.15%  mongod  [.] tc_calloc                                                                                                                                             
+   1.15%  mongod  [.] __wt_struct_pack                                                                                                                                      
+   1.01%  mongod  [.] __wt_page_in_func                                                                                                                                     
+   1.01%  mongod  [.] __wt_struct_size                                                                                                                                      
+   0.99%  mongod  [.] tcmalloc::CentralFreeList::ReleaseToSpans(void*)                                                                                                      
+   0.71%  mongod  [.] tcmalloc::ThreadCache::ReleaseToCentralCache(tcmalloc::ThreadCache::FreeList*, unsigned long, int)                                                    
+   0.63%  mongod  [.] operator new(unsigned long)                                                                                                                           
+   0.52%  mongod  [.] __wt_calloc                                                                                                                                           
+   0.50%  mongod  [.] operator delete(void*)                                                                                                                                

Comment by Martin Bligh [ 30/Nov/15 ]

If we go back to -rc4, it drops out to zero much more frequently, but always bounces back. With current master it falls down and stays down.

Ramon can't repro this on his laptop, but Geert can on his desktop (no params needed for mongod, just default standalone).

Comment by Martin Bligh [ 30/Nov/15 ]

Footnote: with --wiredTigerCacheSizeGB=1 this doesn't happen.
Occasional drop for a second or two from 1.5M/s to 1M/s as dirty% drops, but much much better.

Comment by Bruce Lucas (Inactive) [ 30/Nov/15 ]

Thanks Keith. Martin also reported he tried the patch for WT-2249 and it didn't help.

Comment by Keith Bostic (Inactive) [ 30/Nov/15 ]

I doubt it's connected to WT-2249, that happens when the cache is tiny and we can't find anything to evict (we do a lot of testing with tiny cache sizes to stress eviction).

Comment by Bruce Lucas (Inactive) [ 30/Nov/15 ]

  • initial stalls A-B look a lot like SERVER-21652
  • something rather different seems to be happening starting at C
  • looks like we are trying very hard to evict stuff (see pages walked for eviction), yet according to stats cache is not completely full?
Comment by Martin Bligh [ 30/Nov/15 ]

keith.bostic I think about 15s for the first few minutes, starting every 100s or so (IIRC, checkpoints should be aligned starts on 60s?) then it seems to all go haywire and be much less predictable. The time spacing in the mongostat data above is 1s intervals.

Comment by Martin Bligh [ 30/Nov/15 ]

Worse, after a few minutes, insert performance drops persistently to about 10% of what it was.

Perf top:

 20.50%  mongod                              [.] __wt_row_modify
 13.81%  mongod                              [.] __rec_txn_read
  4.55%  mongod                              [.] __wt_tree_walk
  3.55%  mongod                              [.] __wt_row_search
  3.44%  libc-2.19.so                        [.] __memcpy_sse2_unaligned

 54.42%  mongod                              [.] __wt_page_in_func
  9.32%  mongod                              [.] __wt_row_modify
  4.96%  mongod                              [.] __rec_txn_read
  4.50%  mongod                              [.] __wt_tree_walk
  1.86%  mongod                              [.] __wt_cache_eviction_worker

Callgraph:

-  24.08%  mongod  mongod               [.] __wt_row_modify                   
   - __wt_row_modify                                                          
      - 96.25% __split_multi_inmem                                            
           __wt_split_rewrite                                                 
         - __wt_evict                                                         
            - 99.45% __wt_page_in_func                                        
                 __wt_row_search                                              
               - __wt_btcur_insert                                            
                  + 100.00% __curfile_insert                                  
            + 0.55% __curfile_leave                                           
      + 3.64% __wt_btcur_insert                                               
-  14.14%  mongod  mongod               [.] __rec_txn_read                    
   - __rec_txn_read                                                           
      - 99.61% __rec_row_leaf_insert                                          
         - __wt_reconcile                                                     
            - 73.21% __wt_evict                                               
               - 99.34% __wt_page_in_func                                     
                    __wt_row_search                                           
                  - __wt_btcur_insert                                         
                     + 100.00% __curfile_insert                               
               + 0.66% __curfile_leave                                        
            + 26.79% __wt_cache_op                                            
-  12.24%  mongod  mongod               [.] __wt_tree_walk                    
   - __wt_tree_walk                                                           
      - 99.87% __evict_server                                                 
           start_thread                                                       
+   3.57%  mongod  mongod               [.] __wt_row_search                   
-   3.02%  mongod  libpthread-2.19.so   [.] pthread_mutex_unlock              
   - pthread_mutex_unlock                                                     
      - 95.78% __split_multi_inmem                                            
           __wt_split_rewrite                                                 
         - __wt_evict                                                         
            - 99.40% __wt_page_in_func                                        
                 __wt_row_search                                              
               - __wt_btcur_insert                                            
                  + 100.00% __curfile_insert                                  
            + 0.60% __curfile_leave                                           
      + 3.59% __wt_btcur_insert                                               
+   2.78%  mongod  mongod               [.] tcmalloc::CentralFreeList::RemoveR
-   2.78%  mongod  libpthread-2.19.so   [.] pthread_mutex_lock                
   - pthread_mutex_lock                                                       
      - 94.88% __split_multi_inmem                                            
           __wt_split_rewrite                                                 
         - __wt_evict                                                         
            - 99.36% __wt_page_in_func                                        
                 __wt_row_search                                              
               - __wt_btcur_insert                                            
                  + 100.00% __curfile_insert                                  
            + 0.64% __curfile_leave                                           
      + 4.46% __wt_btcur_insert              

Comment by Keith Bostic (Inactive) [ 30/Nov/15 ]

Yes, likely. How long are the stalls lasting, and they're not aligned with checkpoint?

Comment by Martin Bligh [ 30/Nov/15 ]

localhost:27017  1418372    *0     *0     *0       0   139|0    71.9   95.7       0 18.1G 17.9G   0|0  0|31  717m    32k   34       09:28:55
localhost:27017  1002588    *0     *0     *0       0   105|0    69.2   93.4       0 18.1G 17.9G   0|0  0|32  507m    28k   34       09:28:56
localhost:27017  1399548    *0     *0     *0       0   134|0    70.2   93.3       0 18.1G 17.9G   0|3  0|28  707m    31k   34       09:28:57
localhost:27017  1470036    *0     *0     *0       0   154|0    70.4   94.4       0 18.1G 17.9G   0|3  0|27  743m    33k   34       09:28:58
localhost:27017  1452236    *0     *0     *0       0   151|0    70.7   94.9       0 18.1G 17.9G   0|3  0|30  734m    33k   34       09:28:59
localhost:27017  1519036    *0     *0     *0       0   142|0    71.3   95.9       0 18.1G 17.9G   0|2  0|29  768m    32k   34       09:29:00
                  insert query update delete getmore command % dirty % used flushes vsize   res qr|qw ar|aw netIn netOut conn time
localhost:27017  1408736    *0     *0     *0       0   145|0    70.9   95.3       0 18.1G 17.9G   0|0  0|32  712m    32k   34       09:29:01
localhost:27017  1507972    *0     *0     *0       0   148|0    70.9   95.5       0 18.1G 17.9G   0|3  0|28  762m    33k   34       09:29:02
localhost:27017  1484356    *0     *0     *0       0   155|0    72.2   95.4       0 18.1G 17.9G   0|0  0|30  750m    33k   34       09:29:03
localhost:27017  1455028    *0     *0     *0       0   142|0    72.6   95.9       0 18.1G 17.9G   0|0  0|32  735m    32k   34       09:29:04
localhost:27017  1439364    *0     *0     *0       0   154|0    72.9   95.5       0 18.1G 17.9G   0|5  0|25  727m    33k   34       09:29:05
localhost:27017  1499044    *0     *0     *0       0   147|0    72.9   95.8       0 18.1G 17.9G   0|2  0|28  758m    33k   34       09:29:06
localhost:27017  1408292    *0     *0     *0       0   148|0    72.5   95.6       0 18.1G 17.9G   0|3  0|26  712m    33k   34       09:29:07
localhost:27017  1447300    *0     *0     *0       0   143|0    71.8   95.8       0 18.1G 17.9G   0|2  0|27  731m    32k   34       09:29:08
localhost:27017  1476508    *0     *0     *0       0   146|0    72.6   95.6       0 18.1G 17.9G   0|0  0|30  746m    32k   34       09:29:09
localhost:27017   269780    *0     *0     *0       0    32|0    67.5   89.6       0 18.1G 17.9G   0|0  0|32  136m    21k   34       09:29:10
                  insert query update delete getmore command % dirty % used flushes vsize   res qr|qw ar|aw netIn netOut conn time
localhost:27017       *0    *0     *0     *0       0     1|0    60.4   82.9       0 18.1G 17.9G   0|0  0|32   79b    19k   34       09:29:11
localhost:27017     2208    *0     *0     *0       0     2|0    53.6   76.9       0 18.1G 17.9G   0|0  0|32    1m    19k   34       09:29:12
localhost:27017   398484    *0     *0     *0       0    31|0    51.4   76.8       0 18.1G 17.9G   0|0  0|32  201m    21k   34       09:29:13
localhost:27017   150772    *0     *0     *0       0    16|0    49.2   77.4       0 18.1G 17.9G   0|0  0|32   76m    20k   34       09:29:14
localhost:27017      300    *0     *0     *0       0     1|0    45.6   77.4       0 18.1G 17.9G   0|0  0|32  152k    19k   34       09:29:15
localhost:27017   277428    *0     *0     *0       0    29|0    43.7   78.2       0 18.1G 17.9G   0|0  0|32  140m    21k   34       09:29:16
localhost:27017   233608    *0     *0     *0       0    25|0    41.7   79.0       0 18.1G 17.9G   0|0  0|32  118m    21k   34       09:29:17
localhost:27017      800    *0     *0     *0       0     2|0    38.2   79.0       0 18.1G 17.9G   0|0  0|32  404k    19k   34       09:29:18
localhost:27017   273300    *0     *0     *0       0    29|0    36.4   79.7       0 18.1G 17.9G   0|1  0|30  138m    21k   34       09:29:19
localhost:27017   254772    *0     *0     *0       0    26|0    32.2   75.2       0 18.1G 17.9G   0|0  0|32  129m    21k   34       09:29:20
                  insert query update delete getmore command % dirty % used flushes vsize   res qr|qw ar|aw netIn netOut conn time
localhost:27017       *0    *0     *0     *0       0     1|0    27.5   71.4       0 18.1G 17.9G   0|0  0|32   79b    19k   34       09:29:21
localhost:27017    62920    *0     *0     *0       0     7|0    24.7   71.5       0 18.1G 17.9G   0|0  0|32   32m    19k   34       09:29:22
localhost:27017   351744    *0     *0     *0       0    37|0    23.8   72.7       0 18.1G 17.9G   0|0  0|32  178m    22k   34       09:29:23
localhost:27017       *0    *0     *0     *0       0     1|0    20.8   72.6       0 18.1G 17.9G   0|0  0|32   79b    19k   34       09:29:24
localhost:27017    88000    *0     *0     *0       0    11|0    19.3   72.6       0 18.1G 17.9G   0|0  0|32   44m    20k   34       09:29:25
localhost:27017       64    *0     *0     *0       0     1|0    16.3   72.6       0 18.1G 17.9G   0|0  0|32   79b    19k   34       09:29:26
localhost:27017      664    *0     *0     *0       0     1|0    13.5   72.6       0 18.1G 17.9G   0|0  0|32  303k    19k   34       09:29:27
localhost:27017   464296    *0     *0     *0       0    49|0    13.3   74.0       0 18.1G 17.9G   0|0  0|32  234m    23k   34       09:29:28
localhost:27017       *0    *0     *0     *0       0     1|0    10.5   74.0       0 18.1G 17.9G   0|0  0|32   79b    19k   34       09:29:29
localhost:27017   645932    *0     *0     *0       0    65|0     9.1   73.1       0 18.1G 17.9G   0|1  0|30  327m    25k   34       09:29:30
                  insert query update delete getmore command % dirty % used flushes vsize   res qr|qw ar|aw netIn netOut conn time
localhost:27017  1444176    *0     *0     *0       0   156|0    11.1   75.0       0 18.1G 17.9G   0|1  0|29  729m    33k   34       09:29:31
localhost:27017  1454072    *0     *0     *0       0   147|0    16.0   80.0       0 18.1G 17.9G   0|1  0|28  735m    32k   34       09:29:32
localhost:27017  1526200    *0     *0     *0       0   151|0    21.1   82.1       0 18.1G 17.9G   0|2  0|29  771m    33k   34       09:29:33
localhost:27017  1498280    *0     *0     *0       0   146|0    26.1   79.1       0 18.1G 17.9G   0|1  0|26  757m    32k   34       09:29:34
localhost:27017  1486728    *0     *0     *0       0   158|0    31.2   79.8       0 18.1G 17.9G   0|0  0|29  751m    33k   34       09:29:35
localhost:27017  1585264    *0     *0     *0       0   153|0    36.6   77.6       0 18.1G 17.9G   0|0  0|30  801m    33k   34       09:29:36
localhost:27017  1590892    *0     *0     *0       0   161|0    41.9   80.0       0 18.1G 17.9G   0|1  0|31  804m    34k   34       09:29:37
localhost:27017  1551688    *0     *0     *0       0   156|0    47.0   78.7       0 18.1G 17.9G   0|2  0|30  784m    33k   34       09:29:38
localhost:27017  1642536    *0     *0     *0       0   162|0    52.6   80.0       0 18.1G 17.9G   0|0  0|31  830m    34k   34       09:29:39
localhost:27017  1566220    *0     *0     *0       0   164|0    57.8   79.1       0 18.1G 17.9G   0|0  0|30  791m    34k   34       09:29:40

Comment by Martin Bligh [ 30/Nov/15 ]

                  insert query update delete getmore command % dirty % used flushes vsize   res qr|qw ar|aw netIn netOut conn time
localhost:27017  1484436    *0     *0     *0       0   148|0    71.7   94.6       0 18.3G 18.1G   0|2  0|30  750m    33k   34       09:30:42
localhost:27017  1332080    *0     *0     *0       0   133|0    71.7   94.7       0 18.3G 18.1G   0|0  0|30  673m    31k   34       09:30:43
localhost:27017  1315220    *0     *0     *0       0   137|0    72.2   95.3       0 18.3G 18.1G   0|1  0|28  664m    31k   34       09:30:44
localhost:27017  1433156    *0     *0     *0       0   147|0    72.2   95.6       0 18.3G 18.1G   0|0  0|30  724m    32k   34       09:30:45
localhost:27017  1361200    *0     *0     *0       0   131|0    72.1   95.3       0 18.3G 18.1G   0|3  0|29  688m    31k   34       09:30:46
localhost:27017  1265816    *0     *0     *0       0   127|0    72.2   94.5       0 18.3G 18.1G   0|0  0|30  640m    31k   34       09:30:47
localhost:27017  1443544    *0     *0     *0       0   153|0    73.4   95.1       0 18.3G 18.1G   0|0  0|32  730m    33k   34       09:30:48
localhost:27017  1356840    *0     *0     *0       0   135|0    72.8   95.8       0 18.3G 18.1G   0|5  0|23  685m    31k   34       09:30:49
localhost:27017  1207564    *0     *0     *0       0   119|0    71.9   94.4       0 18.3G 18.1G   0|1  0|29  610m    30k   34       09:30:50
localhost:27017  1148444    *0     *0     *0       0   120|0    70.9   93.2       0 18.3G 18.1G   0|0  0|29  580m    30k   34       09:30:51
                  insert query update delete getmore command % dirty % used flushes vsize   res qr|qw ar|aw netIn netOut conn time
localhost:27017  1506208    *0     *0     *0       0   147|0    71.1   95.2       0 18.3G 18.1G   0|0  0|30  761m    32k   34       09:30:52
localhost:27017  1408084    *0     *0     *0       0   146|0    72.6   96.0       0 18.3G 18.1G   0|0  0|31  711m    32k   34       09:30:53
localhost:27017  1409372    *0     *0     *0       0   139|0    72.9   95.4       0 18.3G 18.1G   0|0  0|30  712m    32k   34       09:30:54
localhost:27017  1408872    *0     *0     *0       0   142|0    72.9   95.6       0 18.3G 18.1G   0|0  0|31  712m    32k   34       09:30:55
localhost:27017  1271672    *0     *0     *0       0   135|0    72.6   95.4       0 18.3G 18.1G   0|4  0|28  643m    31k   34       09:30:56
localhost:27017  1382820    *0     *0     *0       0   139|0    73.5   95.4       0 18.3G 18.1G   0|0  0|29  699m    32k   34       09:30:57
localhost:27017  1212352    *0     *0     *0       0   118|0    71.8   95.3       0 18.3G 18.1G   0|0  0|32  613m    30k   34       09:30:58
localhost:27017  1426420    *0     *0     *0       0   143|0    71.5   95.8       0 18.3G 18.1G   0|0  0|31  721m    32k   34       09:30:59
localhost:27017  1380580    *0     *0     *0       0   138|0    73.1   95.3       0 18.3G 18.1G   0|0  0|32  698m    32k   34       09:31:00
localhost:27017   129508    *0     *0     *0       0    10|0    68.4   90.7       0 18.3G 18.1G   0|0  0|32   66m    19k   34       09:31:01
                  insert query update delete getmore command % dirty % used flushes vsize   res qr|qw ar|aw netIn netOut conn time
localhost:27017       *0    *0     *0     *0       0     1|0    61.6   83.7       0 18.3G 18.1G   0|0  0|32   79b    19k   34       09:31:02
localhost:27017       *0    *0     *0     *0       0     2|0    55.1   78.6       0 18.3G 18.1G   0|0  0|32  133b    19k   34       09:31:03
localhost:27017       64    *0     *0     *0       0     1|0    51.5   78.3       0 18.3G 18.1G   0|0  0|32   79b    19k   34       09:31:04
localhost:27017   438336    *0     *0     *0       0    53|0    51.4   79.7       0 18.3G 18.1G   0|0  0|32  221m    24k   34       09:31:05
localhost:27017       *0    *0     *0     *0       0     1|0    48.4   79.7       0 18.3G 18.1G   0|0  0|32   79b    19k   34       09:31:06
localhost:27017    18400    *0     *0     *0       0     1|0    45.2   79.8       0 18.3G 18.1G   0|0  0|32   10m    19k   34       09:31:07
localhost:27017   489368    *0     *0     *0       0    50|0    44.1   76.8       0 18.3G 18.1G   0|0  0|32  247m    23k   34       09:31:08
localhost:27017       *0    *0     *0     *0       0     1|0    38.1   69.4       0 18.3G 18.1G   0|0  0|32   79b    19k   34       09:31:09
localhost:27017     8096    *0     *0     *0       0     2|0    35.0   69.4       0 18.3G 18.1G   0|0  0|32    4m    19k   34       09:31:10
localhost:27017   511464    *0     *0     *0       0    48|0    34.5   71.0       0 18.3G 18.1G   0|0  0|32  258m    23k   34       09:31:11
                  insert query update delete getmore command % dirty % used flushes vsize   res qr|qw ar|aw netIn netOut conn time
localhost:27017   124864    *0     *0     *0       0    23|0    32.5   71.4       0 18.3G 18.1G   0|0  0|32   63m    21k   34       09:31:12
localhost:27017      736    *0     *0     *0       0     2|0    29.5   71.4       0 18.3G 18.1G   0|0  0|32  404k    19k   34       09:31:13
localhost:27017   427072    *0     *0     *0       0    40|0    27.9   72.0       0 18.3G 18.1G   0|0  0|32  216m    22k   34       09:31:14
localhost:27017    16032    *0     *0     *0       0     3|0    25.7   72.0       0 18.3G 18.1G   0|0  0|32    8m    19k   34       09:31:15
localhost:27017       64    *0     *0     *0       0     1|0    22.6   72.0       0 18.3G 18.1G   0|0  0|32   79b    19k   34       09:31:16
localhost:27017   245052    *0     *0     *0       0    24|0    20.7   72.5       0 18.3G 18.1G   0|1  0|29  124m    21k   34       09:31:17
localhost:27017   389516    *0     *0     *0       0    40|0    19.3   73.1       0 18.3G 18.1G   0|0  0|32  197m    22k   34       09:31:18
localhost:27017       36    *0     *0     *0       0     1|0    16.1   73.2       0 18.3G 18.1G   0|0  0|32   51k    19k   34       09:31:19
localhost:27017   296176    *0     *0     *0       0    30|0    13.7   73.3       0 18.3G 18.1G   0|3  0|27  149m    21k   34       09:31:20
localhost:27017    84224    *0     *0     *0       0    17|0    11.9   73.5       0 18.3G 18.1G   0|0  0|32   43m    20k   34       09:31:21
                  insert query update delete getmore command % dirty % used flushes vsize   res qr|qw ar|aw netIn netOut conn time
localhost:27017       *0    *0     *0     *0       0     1|0     9.2   73.5       0 18.3G 18.1G   0|0  0|32   79b    19k   34       09:31:22
localhost:27017   169140    *0     *0     *0       0    17|0     7.0   72.0       0 18.3G 18.1G   0|1  0|30   85m    20k   34       09:31:23
localhost:27017  1520928    *0     *0     *0       0   150|0    11.1   76.1       0 18.3G 18.1G   0|0  0|32  769m    33k   34       09:31:24
localhost:27017  1473232    *0     *0     *0       0   148|0    16.0   80.6       0 18.3G 18.1G   0|0  0|32  745m    33k   34       09:31:25
localhost:27017  1305796    *0     *0     *0       0   136|0    20.5   83.8       0 18.3G 18.1G   0|0  0|32  659m    31k   34       09:31:26
localhost:27017  1341060    *0     *0     *0       0   128|0    24.5   86.9       0 18.3G 18.1G   0|1  0|31  678m    31k   34       09:31:27
localhost:27017  1589312    *0     *0     *0       0   161|0    29.8   89.6       0 18.3G 18.1G   0|0  0|32  803m    34k   34       09:31:28
localhost:27017  1575536    *0     *0     *0       0   163|0    35.2   88.3       0 18.3G 18.1G   0|0  0|29  796m    34k   34       09:31:29
localhost:27017  1526688    *0     *0     *0       0   150|0    40.3   86.7       0 18.3G 18.1G   0|0  0|32  772m    33k   34       09:31:30
localhost:27017  1491204    *0     *0     *0       0   153|0    45.5   86.1       0 18.3G 18.1G   0|0  0|28  753m    33k   34       09:31:31
                  insert query update delete getmore command % dirty % used flushes vsize   res qr|qw ar|aw netIn netOut conn time
localhost:27017  1523180    *0     *0     *0       0   155|0    50.5   83.1       0 18.3G 18.1G   0|2  0|26  770m    33k   34       09:31:32
localhost:27017  1455484    *0     *0     *0       0   148|0    55.1   83.7       0 18.3G 18.1G   0|0  0|31  735m    32k   34       09:31:33
localhost:27017  1577108    *0     *0     *0       0   157|0    60.4   83.6       0 18.3G 18.1G   0|1  0|30  797m    33k   34       09:31:34
localhost:27017  1599044    *0     *0     *0       0   166|0    66.0   82.5       0 18.3G 18.1G   0|4  0|25  808m    34k   34       09:31:35
localhost:27017  1626592    *0     *0     *0       0   160|0    71.6   83.4       0 18.3G 18.1G   0|0  0|28  822m    34k   34       09:31:36
localhost:27017  1557572    *0     *0     *0       0   152|0    77.0   85.1       0 18.3G 18.1G   0|1  0|28  787m    33k   34       09:31:37
localhost:27017  1516272    *0     *0     *0       0   156|0    81.7   83.9       0 18.3G 18.1G   0|3  0|28  766m    33k   34       09:31:38
localhost:27017  1475556    *0     *0     *0       0   145|0    83.2   85.4       0 18.3G 18.1G   0|0  0|30  745m    32k   34       09:31:39
localhost:27017  1373336    *0     *0     *0       0   142|0    83.7   85.9       0 18.3G 18.1G   0|1  0|30  694m    32k   34       09:31:40
localhost:27017  1429136    *0     *0     *0       0   149|0    84.3   86.4       0 18.3G 18.1G   0|2  0|29  722m    33k   34       09:31:41

Generated at Thu Feb 08 03:58:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.