[SERVER-4649] mongod primary locks up and never kills cursors stalling connections for ever Created: 09/Jan/12 Updated: 15/Aug/12 Resolved: 22/Mar/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 2.0.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Johnny Boy | Assignee: | Kristina Chodorow (Inactive) |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | pv1 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Debian Squeeze with official 10gen deb packages. Each server is also running numactl --interleave=all wrapper. Mixed clients but mainly used by the PHP driver. |
||
| Attachments: |
|
| Operating System: | ALL |
| Participants: |
| Description |
|
At what seems to be random times, the mongod primary never seems to finish some queries. The log always has these entries when this occurs: [conn123812] warning: virtual size (129449MB) - mapped size (123342MB) is large. could indicate a memory leak [initandlisten] connection accepted from 172.16.49.98:40540 #123818 And as we can see in the log, the queries that work seem to take forever: The server resources doesn't seem to be exhausted and restarting the server gets everything back to normal again. |
| Comments |
| Comment by Kristina Chodorow (Inactive) [ 14/Jun/12 ] |
|
@Andreas: for debugging what's slowing MongoDB down, the mailing list is a better format (https://groups.google.com/group/mongodb-user/). If this problem continues post-ETA2, please ask about it there. |
| Comment by Andreas Bernhardsson [ 13/Jun/12 ] |
|
This is still very much a big problem, the assertion as you say, are probably harmless and just a symptom of the underlying issue. I don't know why and the only info the oplog is the assertion that get triggered when the DB stops responding (or becomes super slow). This happens during normal operation, the DB is performing extremely well and have no issues with the speed. Then all of a sudden this issue appears, the cursors and queues spikes and the site stops loading. This happens about every other week for me atm. ETA: And no, no indexes or anything are being build when this happens. It just during normal operation. ETA2: I looked in to the index situation and it seems to have been caused by "ensure index" operations, I've removed them and the index activity disappeared. I don't know if that was what caused the other issues, but time will tell. |
| Comment by Kristina Chodorow (Inactive) [ 14/May/12 ] |
|
This particular assertion is harmless and is a side-effect of a driver bug. The driver bug is triggered by DB slowness, which is the real issue you should try to debug if you get this assertion. The mailing list can help you debug general db slowness, Jira isn't really the place for it. |
| Comment by ttt [ 14/May/12 ] |
|
any resolution on this? |
| Comment by Kristina Chodorow (Inactive) [ 07/Mar/12 ] |
|
There are a bunch of inserts into OviPets.system.indexes, were you building an index? What was writing there? Do you have iostats from the spikes? |
| Comment by Andreas Bernhardsson [ 20/Feb/12 ] |
|
Had a really bad crash day last night, the issues occurred three times in 10 minutes intervals. I've uploaded a oplog from the first restart to the next crash 10 minutes later and a MMS screenshot. |
| Comment by Andreas Bernhardsson [ 18/Feb/12 ] |
|
I added a screenshot from MMS when the issue occurs. |
| Comment by Andreas Bernhardsson [ 15/Feb/12 ] |
|
My OpLog from 20:00:00 when the first _pinValue assertion appeared until about 10 minutes later when I restarted the DB. I've extracted the 32 assertions into it pinvalue.txt This MongoDB server is run on a dedicated server with 24GB of RAM and RAID 10 running Ubuntu 11.04. I'm using 2.0.2 and the latest PHP driver (1.2.7). This issue happened once a day during the last weekend, we had a bit more load that usual which I'm guessing triggered the issue. |
| Comment by Eliot Horowitz (Inactive) [ 15/Feb/12 ] |
|
The pin value error is a bit different, and itself doesn't cause issues. Can add your logs, etc...? |
| Comment by Andreas Bernhardsson [ 13/Feb/12 ] |
|
I commented on Is there any workaround to avoid this issue until this bug is resolved? |
| Comment by Johnny Boy [ 11/Jan/12 ] |
|
Sorry to hear that. Thank you for looking into this and I hope we can solve it. Last night, the pinValue assert occured again, and number of connection started rising. It never got to the memory warning but it does seem similar to how it behaved. Unfortunately I didn't notice and wasn't able to get a currentOp list. The error: Wed Jan 11 05:14:11 [initandlisten] connection accepted from 172.16.49.121:36532 #82981 Wed Jan 11 05:29:37 [initandlisten] connection accepted from 172.16.49.121:39143 #84017 Also there seems to be a lot of idle cursors, is this normal? |
| Comment by Scott Hernandez (Inactive) [ 10/Jan/12 ] |
|
I've looked at the dropbox munin graphs and MMS and unfortunately the logs/graphs don't all line up to provide a good view of what could be the cause. If this happens again, now that MMS is enabled, we will be able to see a better picture of it. Please let us know when this happens again. |
| Comment by Johnny Boy [ 10/Jan/12 ] |
|
The driver is PHP (5.3.8-1~dotdeb.2) and mongo 1.2.6. Yes that is the MMS group. The log timezone is UTC/GMT+1 and munin graph is UTC |
| Comment by Scott Hernandez (Inactive) [ 10/Jan/12 ] |
|
Please verify that you are using the PHP driver 1.2.7 and this is your group in MMS: https://mms.10gen.com/host/list/4f0bf92f47103e8b01b39d26 Also, what time zone are those logs in? |
| Comment by Eliot Horowitz (Inactive) [ 10/Jan/12 ] |
|
What driver are you using? |
| Comment by Johnny Boy [ 10/Jan/12 ] |
|
The MMS is also hooked up now. I think most of the information is already available in our Munin graphs. You can easily see the spike in active connections of when the error occurred to see at what time it is. |
| Comment by Johnny Boy [ 10/Jan/12 ] |
|
I will join MMS and I also have a currentOp log of the time but not with the "true" argument, I hope it still helps a bit. I will add a new log when the error occurs again. Also upgraded to 2.0.2 which still has the same problem. This time I didn't get the virtual size - mapped size warning but instead: Mon Jan 9 19:18:02 [clientcursormon] warning virtual/mapped memory differential is large. journaling:1 – snip snip – Mon Jan 9 19:18:03 [conn21771] Assertion failure cc->_pinValue < 100 db/clientcursor.h 309 |
| Comment by Scott Hernandez (Inactive) [ 09/Jan/12 ] |
|
Can you run db.currentOp(true) when one of these long running queries is running? Are you using MMS and if so can you tell use when this happened so we can look at your server stats? |