Priority: Blocker - P1
Resolution: Gone away
Affects Version/s: 3.1.9
Fix Version/s: None
After further testing, here is summary on what we know:
- this only happens with replSet with secondaries, could not reproduce this with single node replSet
- this is not related to replica protocolVersion
- usually happens after around 15min with continuous insert (c3.8xlarge), easier to reproduce with smaller instance (m3.2xlarge), both setups using two SSD to separate DB file and journal
- can reproduce with 3-node replSet running in the same instance.
- attached calltree, and stacktrace,
- the issue is here for a while, it failed as early as 3.1.8 release.
- we saw two failures during long insert phase of longevity tests. The symptom is insert throughput drop to very low or close to 0
- the traffic is generated with YCSB, will try to create a benchRun repo
- the original issue is observed with shard cluster with replSet as shard, after further test, found this is issue with replSet with secondaries (I am testing with 3 node replSet). Did not see this issue with single node replSet.
- the first SHA observed for this is 3223f84a8eeaf89a30d6789038e5d68c7b019108, the longevity is run once per week, therefore, we do not have a small range or exact SHA now.
- This can be reproduced with 3 node replSet easily, so far, 100% for me.
- on primary, when this drop happens, CPU usage is low, at the beginning DB disk partition show high disk I/O. Then all go largely idle after a while since throughput is dropped to close to 0, this could be a lock up situation.
- Setup replSet with secondaries
- Start continuous insert traffic with YCSB
- Monitor throughput with mongostat, may drop to very long throughput after 15min of test run