[SERVER-12415] spikes in write lock % only on one out of 5 shards Created: 21/Jan/14 Updated: 16/Feb/15 Resolved: 16/Feb/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Concurrency |
| Affects Version/s: | 2.4.8 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Nemanja Dubravac | Assignee: | Stennie Steneker (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
CentOS 6.4, sharded cluster with 5 shards |
||
| Attachments: |
|
| Operating System: | ALL |
| Participants: |
| Description |
|
Since a few days ago (January 14th) we're seeing spikes in write lock % tracked in MMS, but only on one shard out of 5 (our _c shard). We're also seeing spikes in read queue at the same time write lock % spikes. However, we don't see spikes in IO / CPU load or network traffic on those servers, so we don't understand the exact cause of those spikes. |
| Comments |
| Comment by Stennie Steneker (Inactive) [ 19/Mar/14 ] |
|
Hi Nemanja, Thank you for following up with a summary of the root cause of your lock percentage spikes .. I will close out this issue. Large embedded arrays with frequent updates can definitely lead to performance problems. A related blog write up from one of our solution architects includes some considerations: "Why shouldn't I embed large arrays in my documents?". Regards, |
| Comment by Nemanja Dubravac [ 19/Mar/14 ] |
|
Hi Stephen, I'm sorry for not following up on this issue right after we figured what was causing it. Thanks for reminding me of it and hopefully if someone else comes up with the same problem, they will be able to find the following answer: One of our user objects had a key with a huge array value (> 1K elements), and the key was indexed. It's an amateur mistake to have arrays indexed if their size is not limited, but there you have it. Whenever this user was online and active, or some other operation triggered an insert into this growing array, it caused a spike in lock % on the shard where this user object was located. I really hope this will help someone else, because we wasted a lot of hours checking whether sharding was set up correctly, whether balancing was working, whether there were hardware issues... and so on. Anyway, thank you Stephen for checking this issue and trying to help. |
| Comment by Stennie Steneker (Inactive) [ 18/Mar/14 ] |
|
Hi Nemanja, Aside from chunk migrations, some other possibilities for increased write lock % on specific shards can include:
If you would be willing to upload a dump of your config database for review, we could have a look at the migration history and primary shards for unsharded collections. Please confirm if you'd like to do so – we can provide a private SCP dropbox so the information is not public on this issue. Thanks, |
| Comment by Nemanja Dubravac [ 22/Jan/14 ] |
|
There are no chunk migrations at the time of those spikes as balancing is scheduled for night time. Here's a concrete example of such spikes. Under normal behavior (even while having a lot of users) our total backend response time is below 250ms and write lock % is below 10%. However, at random times throughout the whole day we see spikes in write lock % where they reach 60% or more for a short period and then our total response time jumps to several seconds, as you can see in the attached screenshots. |
| Comment by Daniel Pasette (Inactive) [ 21/Jan/14 ] |
|
Are there chunk migrations to or from that shard? You should be able to get an idea of activity by looking at the logs for that shard or running mongotop to get an idea of which collections are active. |