[SERVER-8295] Performance problems when hitting slaveDelay limit after upgrade to 2.2.2 Created: 23/Jan/13  Updated: 11/Jul/16  Resolved: 29/Jan/13

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Javi Martin Assignee: Gregor Macadam
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Operating System: ALL
Participants:

 Description   

MMS
https://mms.10gen.com/host/list/4e89bc12ae6429bfa40fbc21

we are experiencing some performance problems in our delayed replica servers since we upgrade them to 2.2.2.

We have one big DB sharded in two partitions, each one containing one master one replica and one delayed replica. Each of this mongo instances is running on a dedicated server. The delayed replica has a "slaveDelay" value of "345600" (5 days) and with "buildIndexes: false". This scenario was working fine with our old 2.0.0 installation.

Last week we upgrade all servers to 2.2.2, following the upgrade instructions. When we finished, we noticed that the load in both delay servers had increased (from 0.x to 6/7) due to mongo processes performing disk operations. To solve it we perform a full resync (we deleted all the data in the dbpath and re-started the service).

It seemed to solve the problem, but yesterday, just when the one of the servers hit the "slaveDelay" lag, server's load has increased again to 5/6... I saw that the server with the problem has 2 "repl writer worker" while the other (still not in the slaveDelay limit), has only one, but I don't know if this is important or not. I imagine that when the second server hits the limit (between today and tomorrow) it will have the same problem. I searched in the forums but found nothing about how to proceed to solve it.

Can anyone tell me how to proceed?

Forgot to mention... I just changed one of the delayed servers to "--nojournal" to see if it changes something...
At the moment, no change with the "nojournal" option.



 Comments   
Comment by Javi Martin [ 29/Jan/13 ]

Ok! Thank you for your help.

Comment by Gregor Macadam [ 29/Jan/13 ]

I'm going to close this out - if you would like to open a bug for it then feel free.
Thanks
Gregor

Comment by Gregor Macadam [ 28/Jan/13 ]

Hi
I spoke to Kristina who owns replication and what is happening when you sync the new node is that the documents are being synced over the network to the new node, and then the new node tails the oplog - but it waits until it is slaveDelay behind before applying any operations. So this is why after you have synced the node, there is no activity on that node and then suddenly when you are at slaveDelay you are seeing saturated disk - because operations have now begun to be replicated onto the secondary.

So in terms of opening this as a bug - it is a case of this node not being able to replicate easily because of low resident memory - probably influenced by the low RAM available and nagios. I think that if you weren't using slaveDelay, then you would see this issue immediately and not after slaveDelay time. So it's not necessarily connected to slaveDelay I don't think.

If you would like to open this as a bug - as a sort of regression from 2.0.x to 2.2.x where replication is less efficient in low RAM situations then I think that would be fine - we can see what the server team have to say about it at least.

You can switch back to 2.0.8.

Thanks
Gregor

Comment by Javi Martin [ 28/Jan/13 ]

Hi Gregor,

we've just achieved the slaveDelay limit in senni. As you can see in the
MMS graphs, disk IO and system load has increased a lot, just as expected.

Do you want me to do any other test? If not, I'll downgrade to 2.0.8
because we need this server to be ok.

Waiting for your answer.

Regards,

Comment by Javi Martin [ 24/Jan/13 ]

Not now, because it is still under the delaySlave. The problem will come in
5 days, just when the replica lag hits the slaveDelay value.

Comment by Gregor Macadam [ 24/Jan/13 ]

iowait looks ok on MMS and the last iostat you posted didn't show saturated disk - are you still seeing a performance problem your end?

Comment by Javi Martin [ 24/Jan/13 ]

No, it does have the slave delay value configured:

partitionB:SECONDARY> rs.conf()
{
"_id" : "partitionB",
"version" : 32,
"members" : [
{
"_id" : 0,
"host" : "flash:27017"
},

{ "_id" : 3, "host" : "arden:27017" }

,

{ "_id" : 4, "host" : "senni:27017", "priority" : 0, * "slaveDelay" : 345600,* "buildIndexes" : false }

]
}

Comment by Gregor Macadam [ 24/Jan/13 ]

Are you running senni without slaveDelay here?

Comment by Javi Martin [ 24/Jan/13 ]

Done! Now, we have the last version and you got iostat graphs in MMS

Btw, the server is now in state "SECONDARY" since a few minutes ago:

Thu Jan 24 13:38:01 [rsSync] replSet SECONDARY
Thu Jan 24 13:38:01 [rsSync] replSet slavedelay sleep long time: 345600

A random part of iostat:

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sdb               0,00     0,00    0,00    0,00     0,00     0,00     0,00
    0,00    0,00   0,00   0,00
 
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sdb               0,00     2,00    0,00    2,00     0,00    32,00    16,00
    0,02    8,50   8,50   1,70
 
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sdb               0,00     0,00    0,00    0,00     0,00     0,00     0,00
    0,00    0,00   0,00   0,00
 
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sdb               0,00     0,00    0,00    0,00     0,00     0,00     0,00
    0,00    0,00   0,00   0,00
 
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sdb               0,00     0,00    0,00    0,00     0,00     0,00     0,00
    0,00    0,00   0,00   0,00
 
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sdb               0,00    36,00    0,00    2,00     0,00   304,00   152,00
    0,01    3,00   1,50   0,30
 
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sdb               0,00     7,00    0,00    5,00     0,00    96,00    19,20
    0,03    6,20   6,20   3,10
 
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sdb               0,00     0,00    0,00    0,00     0,00     0,00     0,00
    0,00    0,00   0,00   0,00
 
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sdb               0,00     2,00    0,00    8,00     0,00    80,00    10,00
    0,01    0,62   0,25   0,20
 
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sdb               0,00     0,00    0,00    0,00     0,00     0,00     0,00
    0,00    0,00   0,00   0,00
 
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sdb               0,00     0,00    0,00    0,00     0,00     0,00     0,00
    0,00    0,00   0,00   0,00
 
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sdb               0,00     7,00    0,00    4,00     0,00    88,00    22,00
    0,03    7,00   6,50   2,60
 
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sdb               0,00     0,00    0,00    0,00     0,00     0,00     0,00
    0,00    0,00   0,00   0,00

I hope this helps.

Regards,

Comment by Gregor Macadam [ 24/Jan/13 ]

Could you upgrade the mms agent on senni also? It's version 1.3.7 and latest is 1.5.3 - I think this is the reason iostat data is not showing in MMS.

Comment by Gregor Macadam [ 24/Jan/13 ]

OK thanks

Comment by Javi Martin [ 24/Jan/13 ]

I moved the dbpath to partitionB.old and started mongo, so it's doing the
initial sync ("STARTUP2" state). Now, the iostat is as follows (1 second
sample)

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sdb               0,00  5343,00    0,00  135,00     0,00 132784,00   983,59
  134,87  638,53   7,41 100,00
 
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sdb               0,00     0,00    0,00  109,00     0,00 101936,00   935,19
   34,48 1062,01   7,21  78,60
 
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sdb               0,00   152,00    0,00  215,00     0,00 14040,00    65,30
    5,73   26,64   1,37  29,50
 
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sdb               0,00 12439,00    0,00  103,00     0,00 91920,00   892,43
   40,06  271,17   7,01  72,20
 
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sdb               0,00     0,00    0,00   37,00     0,00 32936,00   890,16
    2,11  384,86   7,89  29,20
 
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sdb               0,00     0,00    0,00   24,00     0,00 16448,00   685,33
    0,22    9,25   6,38  15,30
 
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sdb               0,00 14449,00    0,00  126,00     0,00 115432,00   916,13
   49,80  306,00   7,02  88,50
 
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sdb               0,00     9,00    0,00   42,00     0,00 29544,00   703,43
    0,97  290,86   6,86  28,80
 
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sdb               0,00 13361,00    0,00   49,00     0,00 39608,00   808,33
   15,73   42,18   6,33  31,00
 
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sdb               0,00     0,00    0,00  102,00     0,00 95568,00   936,94
   27,84  406,81   6,98  71,20
 
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sdb               0,00     0,00    0,00   23,00     0,00 16352,00   710,96
    0,19    7,74   6,30  14,50
 
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sdb               0,00     0,00    0,00   21,00     0,00 13136,00   625,52
    0,21   10,52   8,48  17,80
 

When it finishes the initial sync and becomes "SECONDARY" I'll send you the
iostat again

Comment by Gregor Macadam [ 24/Jan/13 ]

Do you have iostat data from before the secondary hits the slaveDelay that you could share?

Comment by Javi Martin [ 24/Jan/13 ]

Hi,

I tried it but with no luck...

/extra/mongo/bin/mongod --dbpath /extra/mongo/partitionB --port 27017
--logpath /extra/mongo/logs/partitionB.log --fork --rest --pidfilepath
/extra/mongo/var/pid_partitionB --oplogSize 102400 --shardsvr --replSet
partitionB --replIndexPrefetch none --journal

iostat /dev/sdb 1 1000 -x
 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.74    0.00    0.99   79.65    0.00   18.61
 
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sdb               0.00     0.00  158.00    0.00 25960.00     0.00   164.30
   18.75  120.66   6.33 100.00
 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.25    0.00    0.50   85.46    0.00   13.78
 
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sdb               0.00     0.00  140.00    0.00 27296.00     0.00   194.97
   10.95   80.34   7.14 100.00
 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.46    0.00    1.49   91.58    0.00    2.48
 
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sdb               0.00     0.00  140.00    0.00 27096.00     0.00   193.54
   12.44   88.44   7.14 100.00
 
avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7.71    0.00    2.99   80.85    0.00    8.46
 
Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz
avgqu-sz   await  svctm  %util
sdb               0.00     0.00  145.00    0.00 26296.00     0.00   181.35
   15.75  101.34   6.90 100.10
 
 
Going back to 2.0.8 solves the problem...

Comment by Gregor Macadam [ 24/Jan/13 ]

Hi
Can you try to switch replIndexPrefetch to "none" on senni and see if this helps please?
Thanks
Gregor

Comment by Gregor Macadam [ 23/Jan/13 ]

I would say downgrade that node to 2.0.8 - don't go back to 2.0.0.
It will be fine to run like that though - that node can never become primary anyway.

Comment by Javi Martin [ 23/Jan/13 ]

I'll try to add more RAM to see what happens, but at the moment, I'll
downgrade both nodes. It won't be any problem with the data? I mean, can a
2.2.0 server get data from a 2.2.2 replica set? The fact that this node is
also part of the shard can affect?

Thank you for your help.

Comment by Gregor Macadam [ 23/Jan/13 ]

You can't turn it off. I'm guessing that that is the issue. If you can repro it on another host (with more RAM) then we can investigate the possiblity that this is something specific with slaveDelay.
Yes you could downgrade that node to 2.0.x if that solves the problem.

Comment by Javi Martin [ 23/Jan/13 ]

OK. And can I do something to avoid the current problem? I mean, is there a
way for "turrning off" this parallel replication ops or something like
that? Could it be possible to have just the delayed servers in an old
version? I mean, just downgrade those 2 servers to 2.2.0 and let the rest
of the cluster in 2.2.2?

Comment by Gregor Macadam [ 23/Jan/13 ]

When you are doing the intial sync, you are querying the data files on the primary and writing consecutive blocks of disk on the secondary. You are page faulting but writing and entire page. When you reach slaveDelay, you begin to tail the oplog, you are page faulting but only updating part of the page (one document), page faulting somewhere else and updating another single document. So your page faulting is a lot less efficient when you are tailing the oplog than doing the intial sync. This sounds like it is saturating your disk trying to keep up with replication. There are some changes to replication between 2.0.x and 2.2.x - basically parallel replication operations on the secondary which I assume would explain the difference.

Comment by Javi Martin [ 23/Jan/13 ]

... and... why it performs ok when the slaveDelay is not hit?

Comment by Javi Martin [ 23/Jan/13 ]

Yes. The other host has the same amount of RAM, although is less used (no
nagios). But, in fact, I would expect the same behaviour than with version
2.2.0... and why the 100% in disk usage?

Comment by Gregor Macadam [ 23/Jan/13 ]

OK that explains the low resident - running the OS , nagios and mongodb on 2GB is not going to give you decent performance.
Which other host are you seeing this problem with - it it also low RAM?

Comment by Javi Martin [ 23/Jan/13 ]

Mongo datafiles are on sdb.

The server has 2 Gb of mem installed and 1 Gb of mem free.

[mongo@OBE015 (senni) ~]$ free -mt
total used free shared buffers cached
Mem: 2005 1353 651 0 8 444
-/+ buffers/cache: 901 1103
Swap: 3999 725 3274
Total: 6005 2078 3926

Comment by Gregor Macadam [ 23/Jan/13 ]

It doesn't look that bad - however assuming the data files are stored on sda then you could lower the readahead

sudo blockdev --setra 64 /dev/sda

and restart mongod.
However since those settings don't look terrible it could be that nagios is using a lot of RAM - how much RAM is on that box?

Comment by Javi Martin [ 23/Jan/13 ]

The blockdev report:

The blockdev

[root@OBE015 (senni) ~]# blockdev --report
RO    RA   SSZ   BSZ   StartSec            Size   Device
rw   256   512  4096          0   1000204886016   /dev/sda
rw   256   512  1024       2048       209715200   /dev/sda1
rw   256   512  4096     411648      4194304000   /dev/sda2
rw   256   512  4096    8603648      2097152000   /dev/sda3
rw   256   512  1024   12699648            1024   /dev/sda4
rw   256   512  4096   12701696    993700872192   /dev/sda5
rw   256   512  4096          0   1000204886016   /dev/sdb
rw   256   512  4096         63   1000202241024   /dev/sdb1

Comment by Gregor Macadam [ 23/Jan/13 ]

Resident memory on senni hasn't changed in the last several months so I don't think it is connected but let's get it fixed anyway.
can you do a

sudo blockdev --report

on that host?

Comment by Gregor Macadam [ 23/Jan/13 ]

There is nothing strange with this lag, because senni is configured with a 5 days delay (just to have a backup to prevent accidental deletions)

The other servers can connect to senni with no problem. In fact, in senni's logs I can see the connections:

Wed Jan 23 11:22:16 [initandlisten] connection accepted from X.X.X.144:54034 #4621 (4 connections now open)
Wed Jan 23 11:22:46 [conn4620] end connection X.X.X.147:53738 (3 connections now open)
Wed Jan 23 11:22:46 [initandlisten] connection accepted from X.X.X.147:53742 #4622 (4 connections now open)
Wed Jan 23 11:22:46 [conn4621] end connection X.X.X.144:54034 (3 connections now open)

The connection errors that you see appear because I stopped the mongod processes two hours ago because load was too high and disk IO was at 100%. After stopping the mongod process, all began to work fine. Now, disk IO and load are about 0.
By the way, when I stopped the processes I got this error in the logs in BOTH delayed servers:

Wed Jan 23 11:26:01 [DataFileSync] flushing mmaps took 25678ms for 114 files
Wed Jan 23 11:26:03 got signal 15 (Terminated), will terminate after current cmd ends
Wed Jan 23 11:26:04 [repl writer worker 1] ERROR: writer worker caught exception: interrupted at shutdown on: { ts: Timestamp 1358590987000|27, h: -8072858002892174337, v: 2, op: "d", ns: "xxxx.yy", b: true, o:

Unknown macro: { _id}

}
Wed Jan 23 11:26:04 [repl writer worker 1] Fatal Assertion 16360
0xaffd31 0xac5323 0x9a2e56 0xad35cd 0xb45ba9 0x3e1e4077f1 0x3e1dce5ccd
/extra/mongo/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xaffd31]
/extra/mongo/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0xac5323]
/extra/mongo/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x156) [0x9a2e56]
/extra/mongo/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x26d) [0xad35cd]
/extra/mongo/bin/mongod() [0xb45ba9]
/lib64/libpthread.so.0() [0x3e1e4077f1]
/lib64/libc.so.6(clone+0x6d) [0x3e1dce5ccd]
Wed Jan 23 11:26:04 [repl writer worker 1]

***aborting after fassert() failure

Wed Jan 23 11:26:04 Got signal: 6 (Aborted).

Wed Jan 23 11:26:04 Backtrace:
0xaffd31 0x558bb9 0x3e1dc32900 0x3e1dc32885 0x3e1dc34065 0xac535e 0x9a2e56 0xad35cd 0xb45ba9 0x3e1e4077f1 0x3e1dce5ccd
/extra/mongo/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xaffd31]
/extra/mongo/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x558bb9]
/lib64/libc.so.6() [0x3e1dc32900]
/lib64/libc.so.6(gsignal+0x35) [0x3e1dc32885]
/lib64/libc.so.6(abort+0x175) [0x3e1dc34065]
/extra/mongo/bin/mongod(_ZN5mongo13fassertFailedEi+0xde) [0xac535e]
/extra/mongo/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x156) [0x9a2e56]
/extra/mongo/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x26d) [0xad35cd]
/extra/mongo/bin/mongod() [0xb45ba9]
/lib64/libpthread.so.0() [0x3e1e4077f1]
/lib64/libc.so.6(clone+0x6d) [0x3e1dce5ccd]

I replaced the ns name to "xxxx.yy"

Maybe this error is related to the problem we are having?
What is more, now, when starting the mongod server in stays in state "STARTUP2". In the logs, we got:

Wed Jan 23 13:22:17 [conn74] assertion 13436 not master or secondary; cannot currently read from this replSet member ns:config.settings query:{}
Wed Jan 23 13:22:17 [conn74] problem detected during query over config.settings :

Unknown macro: { $err}

The rs.config() from flash (the primary) is as follows:

PRIMARY> rs.config()
{
"_id" : "partitionB",
"version" : 32,
"members" : [

Unknown macro: { "_id" }

,

Unknown macro: { "_id" }

,

Unknown macro: { "_id" }

]
}

I hope we can find a solution, because now I have both servers down. The only solution at this point is performing a full resync (delete the dbpath and start service again), but I'll wait your answer before doing it just in case you need more details.

Generated at Thu Feb 08 03:17:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.