-
Type:
Bug
-
Resolution: Cannot Reproduce
-
Priority:
Major - P3
-
None
-
Affects Version/s: 2.0.5
-
Component/s: Performance, Stability
-
None
-
Environment:Linux x64, boost 1.41
-
ALL
-
None
-
3
-
None
-
None
-
None
-
None
-
None
-
None
-
None
While trying some new stuff on our sharded and replicated production cluster, involving heavy bursts of writing (e.g. 6000 writes/s), we saw a severe performance degradation: big replication lag, lots of timeouts on reads... We tried to reproduce the issue in tests and got results similar to SERVER-3663. For example, when running 10 writers, we have more than 9000 writes/s, and then when we add 100 readers, the writes collapse to somewhere between 1 and 5 writes/s (while readers happily perform at 3000 reads/s). When we stop the readers, the writes get back to 9000 writes/s. When running 10 readers on a slave, they perform at 1000 reads/s when replication is idle, but go down to 50-100 reads/s when replication is taking place... When running a big number of writers to insert data, after a little while the performances are horrendous and randomly bounce around between plain 0 and spikes at 1000 writes/s.
We think that one possible cause is that the read-write locking is not fair. I've seen that the lock classes are made of layers of encapsulation around one of several read-write lock back ends: is there any special fairness logic that I missed, implemented in these mongodb layers? In our case, the back end used is the shared_mutex of boost. We've been experimenting some changes to it to improve its fairness and it gives some significant results on the behavior of mongodb as mentioned above.
I've read other tickets related to this kind of issue (SERVER-3663, SERVER-3609, SERVER-3801...). We consider our use-case that made us come across this issue as normal operation, we can't separate readers and writers and don't think that the right approach here is to consolidate or throttle our writes or more generally try to avoid the situations where fairness is necessary.