[SERVER-7647] getlasterror seems hang after some queries Created: 13/Nov/12 Updated: 02/Dec/16 Resolved: 10/Apr/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 2.2.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Spiros Ioannou | Assignee: | David Hows |
| Resolution: | Incomplete | Votes: | 2 |
| Labels: | replica, sharding | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
debian, w/ newest java driver |
||
| Attachments: |
|
| Operating System: | Linux |
| Participants: |
| Description |
|
3 shards, in 3 physical servers, 3 replica sets. Each replica set has 2 members + one arbiter. I have 50 threads inserting a document 1000 times each, 50K docs in total. |
| Comments |
| Comment by Stennie Steneker (Inactive) [ 10/Apr/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Daniel, Since you do not have any further information to investigate for this issue (logs, queries, or MMS) we'll resolve this issue as Incomplete. Thanks, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Gibbons [ 30/Jan/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
The nodes are not in MMS and I do not have the queries or logs anymore. If it happens again, I will pull the logs and whatever other info I can from the servers. Another fact I remembered that may be helpful is that the primary had around 170 connections at the time. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by David Hows [ 30/Jan/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Daniel, Are your nodes in MMS? If so, can you provide your MMS URL? Can you outline which queries were running? Are they visible in the logs? If so, can you paste the log output? Thanks, David | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Gibbons [ 29/Jan/13 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I recently experienced this. The symptoms were as follows: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Spiros Ioannou [ 20/Nov/12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
vserver-2:192.168.1.77 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Spiros Ioannou [ 20/Nov/12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I have all the logs but unfortunately I haven't noted which tests showed the timeout. I had examined the logs with no unusual findings. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 20/Nov/12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
From the original issue, the reason things paused was replication falling behind on the w:2 writes. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by David Hows [ 20/Nov/12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Spiros, I'm sorry to hear this, but i understand where you are coming from. The nature of mongodb is to rely on the operating system to control system resource allocation. When you have a system that has potentially high resource contention the performance can indeed be a little unpredictable, as there is never a guarantee of how much of your working data set you will be able to have in RAM at a time. If you were to try running each of the mongod instances in its own VM with a dedicated slice of resources you should see an increase in performance. You may want to look at a paravirtualized driver as this will give your DomU's direct access to the disks and should remove any of the overhead from O/I passing through the Dom0. If there is anything more we can do for you please let me know. Cheers, David | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Spiros Ioannou [ 19/Nov/12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi David,
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by David Hows [ 19/Nov/12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Spiros, Re: background flush - yes this is the background thread that persists data to disk and is controlled by syncdelay. Given that it doesn't spike above a second in the event I will discuss below, it is an unlikely cause (for that event at least), but worth keeping an eye on in general. In terms of getting a better understanding of your environment, can I ask: why are you running your mongo instances as root within what is basically the hypervisor - dom0? Are there any operations that could be contending with Mongo for resources in this domain (like something on the DomU's for example)? Looking at the rs1 set as an example, the only event we have to go in in the MMS graphs shows a page fault event on the 16th at around 12:00 GMT. That event looks to have been an insert event, and the lock percentage on the ifms collection spikes to ~100%, with a lesser lock spike on the local DB (where the oplog resides) - the oplog is essentially what the secondaries must read from when replicating. Around the time of that event, the resident memory on both members halves (down to ~1.23GB from 2.49GB on the primary). That suggests a couple of things: 1. That the pages that were brought into memory at that time were not all touched by mongod (this can be caused by readahead) Some other questions related to this. Are you running multiple instance of mongod in the same dom0? Around the time that the other instances drop, vserver-dev-2:47018 raised its resident memory from 0.5G to 1.6G. This looks like competition between two different mongod instances for RAM. A few suggestions:
Cheers, David | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Spiros Ioannou [ 16/Nov/12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
vserver-dev- {2,3}are HP ProLiant DL360 G7, 12 cores (*2 threads each), 45GB installed memory, 8GB available to dom0 (where mongo runs), drives are 2*10Krpm SAS in RAID0. | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Spiros Ioannou [ 16/Nov/12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi David,
I include info for shards 1 and 2. vserver-dev-2 and vserver-dev-3 are identical in h/w and s/w configuration. in vserver-dev-2 and vserver-dev-3 data and binaries are in the /dev/cciss/c0d0p1 partition
Best Regards, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by David Hows [ 15/Nov/12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Spiros, I've had a look at your systems in MMS. There are a few standouts;
The pagefaults and lock percentage are both worrying and indications of poor performance. Pagefaulting traditionally occurs when your working dataset exceeds the RAM available to your system. Can you give me some background on your systems Disk and RAM, size, etc. Would you also be able to provide me with the output of:
Cheers, David | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Spiros Ioannou [ 15/Nov/12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi David, This could be a duplicate of thanks, | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by David Hows [ 15/Nov/12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Spiros, I can see your group, but there is only the agent and no servers within your group. Have you added any hosts in the past? Cheers, David | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Spiros Ioannou [ 14/Nov/12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
mms group name is "inaccess" (if it's the string shown above the users) but the only errors are agent-specific | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by David Hows [ 13/Nov/12 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Spiros, What you have described sounds like replication lag, thus the safe writes are taking longer to confirm that they have been replicated. You should check the replication lag on the secondaries of your sets. Do you have MMS? If so what is your group name? Cheers, David |