[SERVER-20874] replSet write stall under stress with wiredTiger Created: 09/Oct/15  Updated: 12/Nov/15  Resolved: 11/Nov/15

Status: Closed
Project: Core Server
Component/s: Replication, WiredTiger
Affects Version/s: 3.1.9
Fix Version/s: None

Type: Bug Priority: Blocker - P1
Reporter: Rui Zhang (Inactive) Assignee: Michael Cahill (Inactive)
Resolution: Done Votes: 0
Labels: sys-perf-reg
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File 3.2-rc0-run.tar.gz     HTML File call-tree-replSet-stall.html     HTML File calltree_188a60.html     File diagnostic.data.backup.tar.gz     PNG File diagnostic.data.png     HTML File gdbmon-reverse.html     PNG File gdbmon.png     PNG File pre-A.png     Text File stacktrace.log     Text File stacktrace_188a60.log    
Issue Links:
Related
Operating System: ALL
Steps To Reproduce:

Setup replSet with secondaries
Start continuous insert traffic
Monitor throughput with mongostat, may drop to very long throughput after 15min of test run

Sprint: Performance A (10/08/15), Performance B (11/02/15)
Participants:

 Description   

Summary:

After further testing, here is summary on what we know:

  • this only happens with replSet with secondaries, could not reproduce this with single node replSet
  • this is not related to replica protocolVersion
  • usually happens after around 15min with continuous insert (c3.8xlarge), easier to reproduce with smaller instance (m3.2xlarge), both setups using two SSD to separate DB file and journal
  • can reproduce with 3-node replSet running in the same instance.
  • attached calltree, and stacktrace,
  • the issue is here for a while, it failed as early as 3.1.8 release.

Observation:

  • we saw two failures during long insert phase of longevity tests. The symptom is insert throughput drop to very low or close to 0
  • the traffic is generated with YCSB, will try to create a benchRun repo
  • the original issue is observed with shard cluster with replSet as shard, after further test, found this is issue with replSet with secondaries (I am testing with 3 node replSet). Did not see this issue with single node replSet.
  • the first SHA observed for this is 3223f84a8eeaf89a30d6789038e5d68c7b019108, the longevity is run once per week, therefore, we do not have a small range or exact SHA now.
  • This can be reproduced with 3 node replSet easily, so far, 100% for me.
  • on primary, when this drop happens, CPU usage is low, at the beginning DB disk partition show high disk I/O. Then all go largely idle after a while since throughput is dropped to close to 0, this could be a lock up situation.

Reproduce Steps:

  • Setup replSet with secondaries
  • Start continuous insert traffic with YCSB
  • Monitor throughput with mongostat, may drop to very long throughput after 15min of test run


 Comments   
Comment by Rui Zhang (Inactive) [ 11/Nov/15 ]

The recent sys-perf test passed, and the latest longevity test also successfully done with loading phase. I also did manual test, the latest master is fine.

Close this as gone away

longevity ongoing run: https://evergreen.mongodb.com/task/mongo_longevity_linux_wt_shard_shard_cluster_test_fd539883bfc534a7f16aa352d503c903e9d367df_15_11_11_17_47_43

Comment by Alexander Gorrod [ 26/Oct/15 ]

I've reproduced this performance issue using the tip of master (3d74ad6da4dfeb293cd77ae71583740f008e7007). Since there wasn't information about the replica set configuration in the ticket, I created one:

net:
    port: 27017
replication:
    replSetName: "myRS1"
storage:
    dbPath: "/data/master"
    directoryPerDB: true
    engine: wiredTiger
    syncPeriodSecs: 60
    journal:
        enabled: true
 
    wiredTiger:
        engineConfig:
            cacheSizeGB: 4
            journalCompressor: "none"
            directoryForIndexes: false
        collectionConfig:
           blockCompressor: "snappy"
        indexConfig:
            prefixCompression: true
 
systemLog:
   destination: file
   path: "/data/mongodb_master.log"
   quiet: false
   logAppend: true
replication:
    oplogSizeMB: 10000

With matching configurations for two secondaries.

$ diff s20874_master.conf s20874_sec1.conf
2c2
<     port: 27017
---
>     port: 27018
6c6
<     dbPath: "/data/master"
---
>     dbPath: "/data/sec1"
25c25
<    path: "/data/mongodb_master.log"
---
>    path: "/data/mongodb_sec1.log"

I start all of the nodes on a single machine, and connect them into a replica set. Then I start running the YCSB load against my new replica set as per the instructions above. The results are poor:

mongo connection created with tigger:27017
 10 sec: 193787 operations; 18995 current ops/sec; [INSERT AverageLatency(us)=1550.13]
 20 sec: 430714 operations; 23687.96 current ops/sec; [INSERT AverageLatency(us)=1342.69]
 30 sec: 654334 operations; 22359.76 current ops/sec; [INSERT AverageLatency(us)=1431.23]
 40 sec: 894127 operations; 23976.9 current ops/sec; [INSERT AverageLatency(us)=1322.38]
 50 sec: 1072327 operations; 17818.22 current ops/sec; [INSERT AverageLatency(us)=1792.12]
 60 sec: 1266782 operations; 19443.56 current ops/sec; [INSERT AverageLatency(us)=1638.49]
 70 sec: 1485742 operations; 21893.81 current ops/sec; [INSERT AverageLatency(us)=1454.33]
 80 sec: 1531401 operations; 4565.44 current ops/sec; [INSERT AverageLatency(us)=6886.53]
 90 sec: 1535545 operations; 414.36 current ops/sec; [INSERT AverageLatency(us)=78248.39]
 100 sec: 1541802 operations; 625.7 current ops/sec; [INSERT AverageLatency(us)=51170.5]
 110 sec: 1546598 operations; 479.55 current ops/sec; [INSERT AverageLatency(us)=66590.84]
 120 sec: 1551822 operations; 522.35 current ops/sec; [INSERT AverageLatency(us)=61249.55]
 130 sec: 1555718 operations; 389.56 current ops/sec; [INSERT AverageLatency(us)=81855.73]
 140 sec: 1558652 operations; 293.37 current ops/sec; [INSERT AverageLatency(us)=109444.32]
 150 sec: 1561087 operations; 243.5 current ops/sec; [INSERT AverageLatency(us)=131242.84]
 160 sec: 1563338 operations; 225.08 current ops/sec; [INSERT AverageLatency(us)=142140.46]
 170 sec: 1565560 operations; 222.18 current ops/sec; [INSERT AverageLatency(us)=142351.93]
 180 sec: 1567746 operations; 218.58 current ops/sec; [INSERT AverageLatency(us)=148322.6]
 190 sec: 1569956 operations; 220.98 current ops/sec; [INSERT AverageLatency(us)=144015.93]
 200 sec: 1572073 operations; 211.68 current ops/sec; [INSERT AverageLatency(us)=150722]
 210 sec: 1574117 operations; 204.4 current ops/sec; [INSERT AverageLatency(us)=156846.41]

I suspect this is exacerbated by me choosing a relatively small cache size (4GB), which means that the WiredTiger needs to do more work to maintain old snapshots.

Looking at the state of the system, I see that old named snapshots are being removed, but they are a long way behind the current transaction in the system. For example, in a debugger I see:

(gdb) p *txn_global
$14 = {current = 13465514, last_running = 13465482, oldest_id = 537629, scan_count = 0, checkpoint_id = 7, checkpoint_gen = 74,
  checkpoint_pinned = 13465482, nsnap_rwlock = 0x3733900, nsnap_oldest_id = 537629, nsnaph = {tqh_first = 0xa18eb00,
    tqh_last = 0xa18ea48}, states = 0x4a62000}

That means that the current transaction ID is ~13.5 million. The oldest named live snapshot is ~0.5 million. Meaning that WiredTiger needs to keep the updates from 13 million transactions available to satisfy the oldest named snapshot.

Comment by Martin Bligh [ 23/Oct/15 ]

Dan and I were just discussing if this might be related to SERVER-20174, but I think the stacktrace looks different?
Curious if it reproduces with with --setParameter enableReplSnapshotThread=false ?

Comment by Fangcheng Sun [ 16/Oct/15 ]

We have similar issue on Mongod 3.0.4 and 3.0.7.

Comment by Rui Zhang (Inactive) [ 13/Oct/15 ]

add call-tree from gdbmon (call-tree-replSet-stall.html), this is during the phase with very low insert rate at ~10-20 insert per second.

  • SHA: 4877c6b197438495d9fdd88317d4b79d565014c6 (the latest master)
  • all other stay the same (v1 protocol)

looked at the call-graph, it seems wait is mostly in eviction related function, maybe WT related issue? I am re-run V0 protocol just to be sure it passes.

Comment by Daniel Pasette (Inactive) [ 12/Oct/15 ]

rui.zhang, can you clearly state the reproduction steps needed?

  • This is a three node replset using protocol v1?
  • What is the YCSB configuration, number of documents, number of workers, etc?
  • Can this be reproduced on a single machine or does it require three nodes? What are the machine details?
  • Please tar and attach diagnostic.data from $dbpath.
Generated at Thu Feb 08 03:55:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.