-
Type:
Bug
-
Resolution: Done
-
Priority:
Major - P3
-
None
-
Affects Version/s: 3.2.10, 3.2.12
-
Component/s: Performance
-
None
-
ALL
-
None
-
3
-
None
-
None
-
None
-
None
-
None
-
None
After two days of uptime, some of my shards start responding slower and slower to queries. Even if i stop all the workers doing load into the database and wait for all operations to finish, the mongod instances keep responding very slow when restarting the workers. Queries that took 0.1 seconds to run take 40-50 seconds or more. The operations that seem to trigger this behaviour are bulk updates to a collection ($pull a "job") and insertion into another collection from a different database. The wired tiger cache fills up on those machines, the mongod instance eats up all available RAM and CPU. Running
sync && echo 3 > /proc/sys/vm/drop_caches
doesn't help.
I am pretty sure this is a bug, because after i restart all the mongod instances i have no problem whatsoever for 2-3 days. I have a very fast storage so i don't mind loading the hot data. How can i investigate this problem? What metrics should i monitor? I've tried creating a dummy stress test script to run agains a 3.2 instance to see if i can trigger the bug and compare with a 3.4 instance, but i haven't succeeded yet.
Running too many bulk inserts and updates in the same collection seems to be the culprit. The operations start waiting one after the other and the yields start piling up one on eachother. This is when RAM and CPU usage spike and performance degrades until the restart.