-
Type: Bug
-
Resolution: Done
-
Priority: Blocker - P1
-
None
-
Affects Version/s: 3.1.9
-
Component/s: Replication, WiredTiger
-
ALL
-
-
Performance A (10/08/15), Performance B (11/02/15)
Summary:
After further testing, here is summary on what we know:
- this only happens with replSet with secondaries, could not reproduce this with single node replSet
- this is not related to replica protocolVersion
- usually happens after around 15min with continuous insert (c3.8xlarge), easier to reproduce with smaller instance (m3.2xlarge), both setups using two SSD to separate DB file and journal
- can reproduce with 3-node replSet running in the same instance.
- attached calltree, and stacktrace,
- the issue is here for a while, it failed as early as 3.1.8 release.
Observation:
- we saw two failures during long insert phase of longevity tests. The symptom is insert throughput drop to very low or close to 0
- the traffic is generated with YCSB, will try to create a benchRun repo
- the original issue is observed with shard cluster with replSet as shard, after further test, found this is issue with replSet with secondaries (I am testing with 3 node replSet). Did not see this issue with single node replSet.
- the first SHA observed for this is 3223f84a8eeaf89a30d6789038e5d68c7b019108, the longevity is run once per week, therefore, we do not have a small range or exact SHA now.
- This can be reproduced with 3 node replSet easily, so far, 100% for me.
- on primary, when this drop happens, CPU usage is low, at the beginning DB disk partition show high disk I/O. Then all go largely idle after a while since throughput is dropped to close to 0, this could be a lock up situation.
Reproduce Steps:
- Setup replSet with secondaries
- Start continuous insert traffic with YCSB
- Monitor throughput with mongostat, may drop to very long throughput after 15min of test run