[SERVER-6004] Intensive reading/writing causes reader/writer starvation Created: 05/Jun/12  Updated: 10/Dec/14  Resolved: 24/Jan/14

Status: Closed
Project: Core Server
Component/s: Performance, Stability
Affects Version/s: 2.0.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Pierre Ynard Assignee: Ben Becker
Resolution: Cannot Reproduce Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux x64, boost 1.41


Attachments: Text File Program.cs    
Operating System: ALL
Participants:

 Description   

While trying some new stuff on our sharded and replicated production cluster, involving heavy bursts of writing (e.g. 6000 writes/s), we saw a severe performance degradation: big replication lag, lots of timeouts on reads... We tried to reproduce the issue in tests and got results similar to SERVER-3663. For example, when running 10 writers, we have more than 9000 writes/s, and then when we add 100 readers, the writes collapse to somewhere between 1 and 5 writes/s (while readers happily perform at 3000 reads/s). When we stop the readers, the writes get back to 9000 writes/s. When running 10 readers on a slave, they perform at 1000 reads/s when replication is idle, but go down to 50-100 reads/s when replication is taking place... When running a big number of writers to insert data, after a little while the performances are horrendous and randomly bounce around between plain 0 and spikes at 1000 writes/s.

We think that one possible cause is that the read-write locking is not fair. I've seen that the lock classes are made of layers of encapsulation around one of several read-write lock back ends: is there any special fairness logic that I missed, implemented in these mongodb layers? In our case, the back end used is the shared_mutex of boost. We've been experimenting some changes to it to improve its fairness and it gives some significant results on the behavior of mongodb as mentioned above.

I've read other tickets related to this kind of issue (SERVER-3663, SERVER-3609, SERVER-3801...). We consider our use-case that made us come across this issue as normal operation, we can't separate readers and writers and don't think that the right approach here is to consolidate or throttle our writes or more generally try to avoid the situations where fairness is necessary.



 Comments   
Comment by Ben Becker [ 24/Jan/14 ]

Thanks for the feedback. I ran a similar workload in javascript using 16 concurrent shells, but I'm not able to reproduce the issue against v2.0.9.

Comment by Pierre Ynard [ 20/Jan/14 ]

Hello,

I don't really remember but I suppose so, why else would I have attached a test sample? Anyway I believe the problem is not relevant anymore since we migrated to 2.2 with the new locking logic.

Comment by Ben Becker [ 14/Jan/14 ]

Hi Pierre,

It seems that the attached script generates 100,000 documents, each with a random 'ii' key, and then generates queries for another 100,000,000 queries, each with a random 'ii' value between 0 and 100,000. This means that some queries should not find any documents, but exactly how many depends on the implementation of Random.next().

Does the attached script produce the reported behavior for you?

Comment by Eliot Horowitz (Inactive) [ 05/Jun/12 ]

That doesn't totally make sense to me.
Can you send the code of the test you are running?

Generated at Thu Feb 08 03:10:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.