[SERVER-4684] Severe server slowdown Created: 15/Jan/12  Updated: 15/Jan/12  Resolved: 15/Jan/12

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 2.0.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Aristarkh Zagorodnikov Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: pv1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: Linux
Participants:

 Description   

One of our servers started expecting severe slowdowns (working 10-100x slower) around 17:05 (time for reference only). It continued working, but very slowly, leading to other replica set deciding it's dead around 17:29.
I explored the log, and aside the "long query" warnings (queries that took 100-200ms started taking around 8000ms) there was this one:
Sun Jan 15 17:10:23 [initandlisten] connection accepted from 10.2.1.200:36130 #169664
...
Sun Jan 15 17:10:24 [conn169664] Assertion failure cc->_pinValue < 100 db/clientcursor.h 309
0x57a926 0x5857db 0x7982e9 0x882846 0x889e99 0xaa00c6 0x635bb7 0x7fa26256c9ca 0x7fa261b1b70d
/usr/bin/mongod(_ZN5mongo12sayDbContextEPKc+0x96) [0x57a926]
/usr/bin/mongod(_ZN5mongo8assertedEPKcS1_j+0xfb) [0x5857db]
/usr/bin/mongod() [0x7982e9]
/usr/bin/mongod(_ZN5mongo19receivedKillCursorsERNS_7MessageE+0xb6) [0x882846]
/usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x359) [0x889e99]
/usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x76) [0xaa00c6]
/usr/bin/mongod(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x287) [0x635bb7]
/lib/libpthread.so.0(+0x69ca) [0x7fa26256c9ca]
/lib/libc.so.6(clone+0x6d) [0x7fa261b1b70d]
...
Sun Jan 15 17:10:23 [initandlisten] connection accepted from 10.2.1.200:36130 #169664

Right around the problem start we saw these (there was NO mentioning of any "killcursors" before in the log):
Sun Jan 15 17:05:45 [conn169305] killcursors 64ms
Sun Jan 15 17:05:45 [conn169296] killcursors 64ms
Sun Jan 15 17:05:45 [conn169309] killcursors 60ms

which became a lot worse later:
Sun Jan 15 17:28:42 [conn171423] killcursors 2396ms
Sun Jan 15 17:28:42 [conn171404] killcursors 2396ms
Sun Jan 15 17:28:42 [conn171396] killcursors 2396ms
Sun Jan 15 17:28:42 [conn171428] killcursors 2392ms
Sun Jan 15 17:28:42 [conn171398] killcursors 2392ms
Sun Jan 15 17:28:42 [conn171459] killcursors 2392ms
Sun Jan 15 17:28:42 [conn171436] killcursors 2392ms
Sun Jan 15 17:28:42 [conn171334] killcursors 2492ms

Maybe it's our application mismanaging cursors?

At first I decided that there was some kind of a hard drive failure, leading to this, but just restarting the server fixed the problem completely, so I don't think it's a hardware problem.
The database is 99% read-only. the load is medium-light.

My guess is either some internal structures got damaged, or there is some problem with PHP driver or our pattern of it's usage.

I checked dmesg, kern.log, etc. – no signs of anything going wrong with hardware and/or kernel stuff.
Also, I checked historical atop data, between 16:30-17:00 (before the problem) it did 48120KiB reads and 5936KiB writes using around 7% CPU, and between 17:00-17:30 (the problematic interval) it did 175.2MiB reads and 142.2MiB writes using around 10% CPU (3 times more reads and 30 times more writes) – so it looks like it did a lot of unexpected I/O that our application didn't request (TCP I/O grew up around 30% max). Swap space kept mostly unused, file cache was the same.

This problem recurs without any visible causes, please advise.



 Comments   
Comment by Eliot Horowitz (Inactive) [ 15/Jan/12 ]

there are other cases related to the assert - see the label, so going to close this

Comment by Aristarkh Zagorodnikov [ 15/Jan/12 ]

Additional details: we had this problem all along, but VM guys allowed burst IOPS to be 8x higher for short periods of time. When load came, the burst gone and we got stuck with consumer-HDD-grade speeds for all our LVMs.

Comment by Aristarkh Zagorodnikov [ 15/Jan/12 ]

Sorry for the false alert, working with our VM provider people proved that there is a hidden binding of IOPS slots to memory slots, so having low memory on machines (we set 2Gb RAM) leaded to IOPS being limited to 300 IOPS per device that thrashed everything. Sorry again for this, it appears you can close the case (although that assert might still need some attention).

Comment by Aristarkh Zagorodnikov [ 15/Jan/12 ]

It appears that there was a problem with limited memory on a VM. Increasing amount of RAM and IOPS slots (we have pay-as-you-use virtualization billing) got rid of the problems temporarily, I will report progress later.

Comment by Aristarkh Zagorodnikov [ 15/Jan/12 ]

php driver is 1.2.6

Comment by Eliot Horowitz (Inactive) [ 15/Jan/12 ]

What version of the php driver?

Comment by Aristarkh Zagorodnikov [ 15/Jan/12 ]

Query execution skyrockets after the problem triggers, check this, the longer, the worse:
Sun Jan 15 18:16:02 [conn914] query drive.entries2 nscanned:7723 nreturned:101 reslen:95203 886150ms
Sun Jan 15 18:16:02 [conn1063] query drive.entries2 nscanned:7723 nreturned:101 reslen:95203 808726ms
Sun Jan 15 18:16:02 [conn1436] query drive.entries2 nscanned:7723 nreturned:101 reslen:95203 648782ms

Comment by Aristarkh Zagorodnikov [ 15/Jan/12 ]

I would like to comment that this runs on Xen virtual environment, using 2.6.32.36 kernel.

Generated at Thu Feb 08 03:06:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.