[SERVER-16056] Periodic throughput pauses during write load on WT Created: 10/Nov/14  Updated: 11/Jul/16  Resolved: 21/Nov/14

Status: Closed
Project: Core Server
Component/s: Performance
Affects Version/s: 2.8.0-rc0
Fix Version/s: 2.8.0-rc1

Type: Bug Priority: Critical - P2
Reporter: Rui Zhang (Inactive) Assignee: Daniel Pasette (Inactive)
Resolution: Done Votes: 0
Labels: 28qa
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File 11_20_thread_16.png     PNG File 11_20_thread_64.png     PNG File 11_20_thread_8.png     PNG File cp_concurrency_test.png     PNG File longrun_1030_16056.png     PNG File longrun_16056.png     PNG File ops_per_second_16056.png     PNG File overview_16056.png    
Issue Links:
Related
Operating System: ALL
Participants:

 Description   

test with this SHA

[slave-7] ➜  wt  cat bin/2014-11-10/build.info
commit 7fa44159b371c106cd9742174ff13a22aab5ce21

there is performance regression comparing to 11-07 and earlier build.



 Comments   
Comment by Rui Zhang (Inactive) [ 20/Nov/14 ]

tested MCI build with 217151b66aefddca0a62e92aa095bb4f27dba574

the chart comparing mmapv1 vs wt for the same 11-20 build.

mix traffic with 8 threads

  • mmapv1 and wt show similar throughput. The traffic is about 20% write and 80% mix queries

mix traffic with 16 threads

  • throughput is relative smooth,

mix traffic with 64 threads

  • throughput is smooth, but there is a dip in throughput for WT for about 100 sec. disk write (kB_wr) and context switching (cswch and nvcswh) show lots of activities (cswch lower, and nvcswch go crazy, maybe something to do with host OS). No exactly sure what's going on there, I think it could be another issue.

  • overall, no pause for WT
  • behavior is comparable to mmapv1
  • there is a long dip in throughput for WT with 64 thread, need investigation

explain of stats

  • kB_wr: kilobytes of wr per second for mongod
  • cswch: total context switch for mongod (add all the thread's read together)
  • nvcswch: non voluntary context switch (sum of all threads), more details, please refer to pidstat doc
Comment by Rui Zhang (Inactive) [ 19/Nov/14 ]

running today's build, will update when it is done.

Comment by Rui Zhang (Inactive) [ 14/Nov/14 ]

test after applying the above patch to today's latest master.

  • there is definitely some improvement over 11-11 build (before the patch), the dip in throughput is not as dramatic
  • there is definitely up and down in throughput
  • periodic disk write is still pretty high comparing to 10-30 build.

git log:

commit f3d13ab3bb5375ae09e5b1d8b896709d5b6d4950
Author: Michael Cahill <....>
Date:   Thu Nov 13 21:09:21 2014 +1100
 
    Add a lock around table operations, so that cursor opens don't see tables half created or half dropped.
 
    Signed-off-by: Rui Zhang <....>
 
commit 4298d91b5e6f36a4a79580e78abf212625c44b5b
Author: Michael Cahill <....>
Date:   Thu Nov 13 21:09:20 2014 +1100
 
    Split the schema lock into a lock that prevents concurrent schema-changing operations (create, drop, rename, etc.), and a separate lock that protects the shared handle list. The main goal in the short term is to ensure that checkpoints don't hold any locks that block cursor opens while doing I/O.
 
    Signed-off-by: Rui Zhang <....>
 
commit 284f942a45b877f0baecd19cbf17fc2a4e246a79
Author: Kaloian Manassiev <....>
Date:   Thu Nov 6 22:18:15 2014 -0500
 
    SERVER-14062 Cleanup Client::hasWrittenSinceCheckpoint and some usages of cc()
 
    This is in preparation for removing the Global OperationContext registry.
    OperationContexts will instead be reachable through the client.

Comment by Daniel Pasette (Inactive) [ 13/Nov/14 ]

This PR in wiredtiger should mostly solve this issue:
https://github.com/wiredtiger/wiredtiger/pull/1380

Generated at Thu Feb 08 03:39:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.