[SERVER-12415] spikes in write lock % only on one out of 5 shards Created: 21/Jan/14  Updated: 16/Feb/15  Resolved: 16/Feb/15

Status: Closed
Project: Core Server
Component/s: Concurrency
Affects Version/s: 2.4.8
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Nemanja Dubravac Assignee: Stennie Steneker (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

CentOS 6.4, sharded cluster with 5 shards


Attachments: PNG File backend_response_time.png     PNG File shard_c_write_lock_percentage.png    
Operating System: ALL
Participants:

 Description   

Since a few days ago (January 14th) we're seeing spikes in write lock % tracked in MMS, but only on one shard out of 5 (our _c shard). We're also seeing spikes in read queue at the same time write lock % spikes.
There are a few non-sharded collections having this shard as their primary, but they don't have that many writes.

However, we don't see spikes in IO / CPU load or network traffic on those servers, so we don't understand the exact cause of those spikes.
Can anybody explain what could be the bottleneck other than disk IO, CPU, or network bandwidth?



 Comments   
Comment by Stennie Steneker (Inactive) [ 19/Mar/14 ]

Hi Nemanja,

Thank you for following up with a summary of the root cause of your lock percentage spikes .. I will close out this issue.

Large embedded arrays with frequent updates can definitely lead to performance problems. A related blog write up from one of our solution architects includes some considerations: "Why shouldn't I embed large arrays in my documents?".

Regards,
Stephen

Comment by Nemanja Dubravac [ 19/Mar/14 ]

Hi Stephen,

I'm sorry for not following up on this issue right after we figured what was causing it. Thanks for reminding me of it and hopefully if someone else comes up with the same problem, they will be able to find the following answer:

One of our user objects had a key with a huge array value (> 1K elements), and the key was indexed. It's an amateur mistake to have arrays indexed if their size is not limited, but there you have it. Whenever this user was online and active, or some other operation triggered an insert into this growing array, it caused a spike in lock % on the shard where this user object was located.

I really hope this will help someone else, because we wasted a lot of hours checking whether sharding was set up correctly, whether balancing was working, whether there were hardware issues... and so on.

Anyway, thank you Stephen for checking this issue and trying to help.

Comment by Stennie Steneker (Inactive) [ 18/Mar/14 ]

Hi Nemanja,

Aside from chunk migrations, some other possibilities for increased write lock % on specific shards can include:

  • write activity for unsharded collections (which will live on a primary shard for each database)
  • a poor shard key - for example, a monotonically increasing value (such as ObjectID) can create a "hot shard"

If you would be willing to upload a dump of your config database for review, we could have a look at the migration history and primary shards for unsharded collections. Please confirm if you'd like to do so – we can provide a private SCP dropbox so the information is not public on this issue.

Thanks,
Stephen

Comment by Nemanja Dubravac [ 22/Jan/14 ]

There are no chunk migrations at the time of those spikes as balancing is scheduled for night time.
I can see which collections are active, but I don't understand what is causing the write lock to jump to 60% and several hundred reads are queued if there are no hardware bottlenecks visible.
Can you tell me what else can affect the write lock % other than the mentioned hardware resources?

Here's a concrete example of such spikes. Under normal behavior (even while having a lot of users) our total backend response time is below 250ms and write lock % is below 10%. However, at random times throughout the whole day we see spikes in write lock % where they reach 60% or more for a short period and then our total response time jumps to several seconds, as you can see in the attached screenshots.

Comment by Daniel Pasette (Inactive) [ 21/Jan/14 ]

Are there chunk migrations to or from that shard? You should be able to get an idea of activity by looking at the logs for that shard or running mongotop to get an idea of which collections are active.

Generated at Thu Feb 08 03:28:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.