[SERVER-4649] mongod primary locks up and never kills cursors stalling connections for ever Created: 09/Jan/12  Updated: 15/Aug/12  Resolved: 22/Mar/12

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 2.0.1
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Johnny Boy Assignee: Kristina Chodorow (Inactive)
Resolution: Incomplete Votes: 0
Labels: pv1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Debian Squeeze with official 10gen deb packages.
Two servers running on a replicaset.

Each server is also running numactl --interleave=all wrapper.

Mixed clients but mainly used by the PHP driver.


Attachments: Text File currentop.log     GIF File mms.gif     GIF File mms2.gif     Text File oplog - start to crash.log     Text File oplog.log     Text File pinvalue.txt    
Operating System: ALL
Participants:

 Description   

At what seems to be random times, the mongod primary never seems to finish some queries.
Some cursors had been running for 6+ hours. It has only happened on the primary.

The log always has these entries when this occurs:

[conn123812] warning: virtual size (129449MB) - mapped size (123342MB) is large. could indicate a memory leak

[initandlisten] connection accepted from 172.16.49.98:40540 #123818
[conn123816] end connection 172.16.49.98:40539
[conn123817] Assertion failure cc->_pinValue < 100 db/clientcursor.h 309
[conn123817] killcursors exception: assertion db/clientcursor.h:309 1ms

And as we can see in the log, the queries that work seem to take forever:
[conn123170] query x.scores nscanned:1015 scanAndOrder:1 nreturned:306 reslen:12260 302422ms

The server resources doesn't seem to be exhausted and restarting the server gets everything back to normal again.



 Comments   
Comment by Kristina Chodorow (Inactive) [ 14/Jun/12 ]

@Andreas: for debugging what's slowing MongoDB down, the mailing list is a better format (https://groups.google.com/group/mongodb-user/). If this problem continues post-ETA2, please ask about it there.

Comment by Andreas Bernhardsson [ 13/Jun/12 ]

This is still very much a big problem, the assertion as you say, are probably harmless and just a symptom of the underlying issue.

I don't know why and the only info the oplog is the assertion that get triggered when the DB stops responding (or becomes super slow).

This happens during normal operation, the DB is performing extremely well and have no issues with the speed. Then all of a sudden this issue appears, the cursors and queues spikes and the site stops loading. This happens about every other week for me atm.

ETA: And no, no indexes or anything are being build when this happens. It just during normal operation.

ETA2: I looked in to the index situation and it seems to have been caused by "ensure index" operations, I've removed them and the index activity disappeared. I don't know if that was what caused the other issues, but time will tell.

Comment by Kristina Chodorow (Inactive) [ 14/May/12 ]

This particular assertion is harmless and is a side-effect of a driver bug. The driver bug is triggered by DB slowness, which is the real issue you should try to debug if you get this assertion. The mailing list can help you debug general db slowness, Jira isn't really the place for it.

Comment by ttt [ 14/May/12 ]

any resolution on this?

Comment by Kristina Chodorow (Inactive) [ 07/Mar/12 ]

There are a bunch of inserts into OviPets.system.indexes, were you building an index? What was writing there? Do you have iostats from the spikes?

Comment by Andreas Bernhardsson [ 20/Feb/12 ]

Had a really bad crash day last night, the issues occurred three times in 10 minutes intervals. I've uploaded a oplog from the first restart to the next crash 10 minutes later and a MMS screenshot.

Comment by Andreas Bernhardsson [ 18/Feb/12 ]

I added a screenshot from MMS when the issue occurs.

Comment by Andreas Bernhardsson [ 15/Feb/12 ]

My OpLog from 20:00:00 when the first _pinValue assertion appeared until about 10 minutes later when I restarted the DB.

I've extracted the 32 assertions into it pinvalue.txt

This MongoDB server is run on a dedicated server with 24GB of RAM and RAID 10 running Ubuntu 11.04. I'm using 2.0.2 and the latest PHP driver (1.2.7).

This issue happened once a day during the last weekend, we had a bit more load that usual which I'm guessing triggered the issue.

Comment by Eliot Horowitz (Inactive) [ 15/Feb/12 ]

The pin value error is a bit different, and itself doesn't cause issues.

Can add your logs, etc...?

Comment by Andreas Bernhardsson [ 13/Feb/12 ]

I commented on SERVER-4617 but this seems to be the exact same issue?

Is there any workaround to avoid this issue until this bug is resolved?

Comment by Johnny Boy [ 11/Jan/12 ]

Sorry to hear that. Thank you for looking into this and I hope we can solve it.

Last night, the pinValue assert occured again, and number of connection started rising. It never got to the memory warning but it does seem similar to how it behaved. Unfortunately I didn't notice and wasn't able to get a currentOp list.

The error:

Wed Jan 11 05:14:11 [initandlisten] connection accepted from 172.16.49.121:36532 #82981
Wed Jan 11 05:14:11 [conn82981] Assertion failure cc->_pinValue < 100 db/clientcursor.h 309
0x57a926 0x5857db 0x7982e9 0x882846 0x889e99 0xaa00c6 0x635bb7 0x7f2e115ea8ba 0x7f2e10ba602d /usr/bin/mongod(_ZN5mongo12sayDbContextEPKc+0x96) [0x57a926]
/usr/bin/mongod(_ZN5mongo8assertedEPKcS1_j+0xfb) [0x5857db]
/usr/bin/mongod() [0x7982e9]
/usr/bin/mongod(_ZN5mongo19receivedKillCursorsERNS_7MessageE+0xb6) [0x882846]
/usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x359) [0x889e99] /usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x76) [0xaa00c6]
/usr/bin/mongod(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x287) [0x635bb7]
/lib/libpthread.so.0(+0x68ba) [0x7f2e115ea8ba] /lib/libc.so.6(clone+0x6d) [0x7f2e10ba602d]
Wed Jan 11 05:14:11 [conn82981] killcursors exception: assertion db/clientcursor.h:309 1ms

Wed Jan 11 05:29:37 [initandlisten] connection accepted from 172.16.49.121:39143 #84017
Wed Jan 11 05:29:37 [conn84016] end connection 172.16.49.121:39142
Wed Jan 11 05:29:37 [conn84017] Assertion failure cc->_pinValue < 100 db/clientcursor.h 309
0x57a926 0x5857db 0x7982e9 0x882846 0x889e99 0xaa00c6 0x635bb7 0x7f2e115ea8ba 0x7f2e10ba602d
/usr/bin/mongod(_ZN5mongo12sayDbContextEPKc+0x96) [0x57a926]
/usr/bin/mongod(_ZN5mongo8assertedEPKcS1_j+0xfb) [0x5857db]
/usr/bin/mongod() [0x7982e9]
/usr/bin/mongod(_ZN5mongo19receivedKillCursorsERNS_7MessageE+0xb6) [0x882846]
/usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x359) [0x889e99]
/usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x76) [0xaa00c6]
/usr/bin/mongod(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x287) [0x635bb7]
/lib/libpthread.so.0(+0x68ba) [0x7f2e115ea8ba]
/lib/libc.so.6(clone+0x6d) [0x7f2e10ba602d]
Wed Jan 11 05:29:37 [conn84017] killcursors exception: assertion db/clientcursor.h:309 2ms

Also there seems to be a lot of idle cursors, is this normal?
PRIMARY> db.currentOp().inprog.length
6
PRIMARY> db.currentOp(true).inprog.length
605

Comment by Scott Hernandez (Inactive) [ 10/Jan/12 ]

I've looked at the dropbox munin graphs and MMS and unfortunately the logs/graphs don't all line up to provide a good view of what could be the cause.

If this happens again, now that MMS is enabled, we will be able to see a better picture of it. Please let us know when this happens again.

Comment by Johnny Boy [ 10/Jan/12 ]

The driver is PHP (5.3.8-1~dotdeb.2) and mongo 1.2.6.

Yes that is the MMS group.

The log timezone is UTC/GMT+1 and munin graph is UTC

Comment by Scott Hernandez (Inactive) [ 10/Jan/12 ]

Please verify that you are using the PHP driver 1.2.7 and this is your group in MMS: https://mms.10gen.com/host/list/4f0bf92f47103e8b01b39d26

Also, what time zone are those logs in?

Comment by Eliot Horowitz (Inactive) [ 10/Jan/12 ]

What driver are you using?

Comment by Johnny Boy [ 10/Jan/12 ]

The MMS is also hooked up now.

I think most of the information is already available in our Munin graphs.
Here is the gallery of the daily graphs that I think may be interesting: https://www.dropbox.com/gallery/2468164/1/Munin?h=08adec

You can easily see the spike in active connections of when the error occurred to see at what time it is.

Comment by Johnny Boy [ 10/Jan/12 ]

I will join MMS and I also have a currentOp log of the time but not with the "true" argument, I hope it still helps a bit.

I will add a new log when the error occurs again.

Also upgraded to 2.0.2 which still has the same problem.

This time I didn't get the virtual size - mapped size warning but instead:

Mon Jan 9 19:18:02 [clientcursormon] warning virtual/mapped memory differential is large. journaling:1

– snip snip –

Mon Jan 9 19:18:03 [conn21771] Assertion failure cc->_pinValue < 100 db/clientcursor.h 309
Mon Jan 9 19:18:03 [conn21770] end connection 172.16.49.121:45331
0x57a926 0x5857db 0x7982e9 0x882846 0x889e99 0xaa00c6 0x635bb7 0x7f4d6f9968ba 0x7f4d6ef5202d
/usr/bin/mongod(_ZN5mongo12sayDbContextEPKc+0x96) [0x57a926]
/usr/bin/mongod(_ZN5mongo8assertedEPKcS1_j+0xfb) [0x5857db]
/usr/bin/mongod() [0x7982e9]
/usr/bin/mongod(_ZN5mongo19receivedKillCursorsERNS_7MessageE+0xb6) [0x882846]
/usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x359) [0x889e99]
/usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x76) [0xaa00c6]
/usr/bin/mongod(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x287) [0x635bb7]
/lib/libpthread.so.0(+0x68ba) [0x7f4d6f9968ba]
/lib/libc.so.6(clone+0x6d) [0x7f4d6ef5202d]
Mon Jan 9 19:18:03 [conn21771] killcursors exception: assertion db/clientcursor.h:309 1ms
Mon Jan 9 19:18:03 [conn21771] end connection 172.16.49.121:45332

Comment by Scott Hernandez (Inactive) [ 09/Jan/12 ]

Can you run db.currentOp(true) when one of these long running queries is running?

Are you using MMS and if so can you tell use when this happened so we can look at your server stats?

Generated at Thu Feb 08 03:06:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.