Loading...

XML

Word

Printable

JSON

Operating System:
ALL
Steps To Reproduce:

Hide

Setup replSet with secondaries
Start continuous insert traffic
Monitor throughput with mongostat, may drop to very long throughput after 15min of test run

Show
Setup replSet with secondaries Start continuous insert traffic Monitor throughput with mongostat, may drop to very long throughput after 15min of test run
Sprint:
Performance A (10/08/15), Performance B (11/02/15)
CAR Domain/s:
None

Summary:

After further testing, here is summary on what we know:

this only happens with replSet with secondaries, could not reproduce this with single node replSet
this is not related to replica protocolVersion
usually happens after around 15min with continuous insert (c3.8xlarge), easier to reproduce with smaller instance (m3.2xlarge), both setups using two SSD to separate DB file and journal
can reproduce with 3-node replSet running in the same instance.
attached calltree, and stacktrace,
the issue is here for a while, it failed as early as 3.1.8 release.

we saw two failures during long insert phase of longevity tests. The symptom is insert throughput drop to very low or close to 0
the traffic is generated with YCSB, will try to create a benchRun repo
the original issue is observed with shard cluster with replSet as shard, after further test, found this is issue with replSet with secondaries (I am testing with 3 node replSet). Did not see this issue with single node replSet.
the first SHA observed for this is 3223f84a8eeaf89a30d6789038e5d68c7b019108, the longevity is run once per week, therefore, we do not have a small range or exact SHA now.
This can be reproduced with 3 node replSet easily, so far, 100% for me.
on primary, when this drop happens, CPU usage is low, at the beginning DB disk partition show high disk I/O. Then all go largely idle after a while since throughput is dropped to close to 0, this could be a lock up situation.

Setup replSet with secondaries
Start continuous insert traffic with YCSB
Monitor throughput with mongostat, may drop to very long throughput after 15min of test run

Assignee:: Michael Cahill (Inactive)
Reporter:: Rui Zhang (Inactive)
Participants:: Alexander Gorrod, Daniel Pasette, Fangcheng Sun, Martin Bligh, Michael Cahill, Rui Zhang
Votes:: 0 Vote for this issue
Watchers:: 16 Start watching this issue