[SERVER-7810] Replication lag when updates handled by database. Created: 30/Nov/12  Updated: 11/Jul/16  Resolved: 22/Jan/13

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.2.2
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Azat Khuzhin Assignee: David Hows
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

version: 2.2.2-rc1
git version: nogitversion


Attachments: PNG File buffers-memory.png     PNG File cached-memory.png     PNG File complex-load.png     PNG File connections-more-long-period.png     PNG File connections.png     PNG File database-operations.png     PNG File free-memory.png     PNG File hdd.png     PNG File memory-footprints-2.png     PNG File memory-footprints.png     PNG File operations-counters.png     PNG File page-faults-minute.png     PNG File replication-lag.png     PNG File total-background-flus-time-ms-in-last-1-minute.png    
Participants:

 Description   

I have next configuration:
server1 - (mongos, mongod cfg, mongod --replSet first, application server[write] )
server2 - (mongod arbiter, mongod --replSet first)
server3 - (mongod cfg, mongod arbiter, mongod --replSet second)
server4 - (mongos, mongod cfg, mongod --replSet second, application server[read] )

All requests for insert/update going to server1/application server.
All requetss for find going to server4/application server.

There was about 20 concurrent updates. (collection that handle updates have 819 366 177 documents, and it is capped. I don't change object size in update.)
Update was handled using indexes for where condition.

But there is next interesting behavior:

  • If I set master for replset first, to server2 - it handle updates normally, but server1 has replication lag, and it grow up to 1 hour.
  • Then I reject updates for some time, and when replication lag equal to zero, I set master to server1.
  • And server1 also handle updates normally, but server2 now has replication lag, and it grows.

Why SECONDARY can't handle operations like PRIMARY in this case?



 Comments   
Comment by Azat Khuzhin [ 13/Jan/13 ]

Hi David,

It seems not.
You can close it.

Thanks.

Comment by David Hows [ 28/Dec/12 ]

Hi Azat,

Is there anything further we can do with this issue?

Or can we close it?

Cheers,

David

Comment by Azat Khuzhin [ 17/Dec/12 ]

> The memory-footprints.png output would have been the graph I would use for this, but it doesn't contain any data about cache or buffered memory (as free does). Additionally the memory footprints numbers are extraordinarily low (1M of virtual memory max) and I wanted to confirm these with top.

Sorry, the units on graph is incorrect.
I attached new graph (memory-footprints-2.png), and instead of KB there is GB.

What about cached/buffers/free see new attachments.
Total memory: 48 GiB

Yes, I also thinks that IO is the main problem.
And from "complex-load.png" you can see that iowait is high.

I thought about MMS, but for now the internal monitoring system is enough.
I understand that, for you MMS is more comfortable, but for now I can't install it.

Thanks for help.

Comment by David Hows [ 17/Dec/12 ]

Hi Azat,

The memory-footprints.png output would have been the graph I would use for this, but it doesn't contain any data about cache or buffered memory (as free does). Additionally the memory footprints numbers are extraordinarily low (1M of virtual memory max) and I wanted to confirm these with top.

The pagefault numbers are also supremely high averaging 5000 per minute according to your graph and the background flush times are also higher than i would expect. This leads me to believe you have memory usage and IO potential IO problems in your environment.

Have you considered installing MMS? Its free and available at mms.10gen.com

Cheers,

David

Comment by Azat Khuzhin [ 14/Dec/12 ]

Hi David,

I'v already fix this, using "writeConcern", with "w=2", so this not reproduced now.

But I found information about page faults and background flush times. See attachment.

A simple document:

{
        "_id" : ObjectId("4ffcb7de0f76494937001e00"),
        "key" : "80e14fda39fcf7ce4514ae1dfffad8c6",
        "country_code" : "MY",
        "_domainid" : ObjectId("4ffcb7de0f76494937001e00"),
        "uhash" : 926791063
}

Update:

db.foo.update({key: KEY, _id: {$lt: {ID}}}, {$set: {_domainid: NEW_DOMAIN_ID}}, false, true)

May be useful:

> db.foo.getIndexes()
[
        {
                "v" : 1,
                "key" : {
                        "_id" : 1
                },
                "ns" : "stat.foo",
                "name" : "_id_"
        },
        {
                "v" : 1,
                "key" : {
                        "key" : 1
                },
                "ns" : "stat.foo",
                "name" : "key_1"
        },
        {
                "v" : 1,
                "key" : {
                        "_domainid" : 1
                },
                "ns" : "stat.foo",
                "name" : "_domainid_1"
        }
]

And what about top and free, could you explain what you need from this commands?
Maybe I can grab this from monitoring system.
And is it not the same as "memory-footprints.png" and "complex-load.png" ?

Comment by David Hows [ 11/Dec/12 ]

Hi Azat,

From what I have been able to gather from your graphs you may have issues with your working set or updates.

Are you able to get information about mongod background flush times and pagefaults? Would it be possible for you to start using MMS for 24-48 hours and share your URL?

Can you attach a sample document so i can see your schema and can you explain what kind of updates you do to your documents?

If possible i would also like to see data about your mongod instances memory usage. Could you attach the output of:

top
free -m

Cheers,

David

Comment by Azat Khuzhin [ 10/Dec/12 ]

David, I attached graphs.

If you have any questions about it, feel free to ask.
This graphs is not good, so I think that questions arise.

Comment by David Hows [ 07/Dec/12 ]

Hi Azat,

Would you be able to attach graphs of the following from your stats along with the disk and CPU statistics?

Mongo Connections
MongoDB Database Operations
MongoDB Memory Footprint
MongoDB Ops Counters

Can you also show the two points of data that show the hours replication lag?

Cheers,

David

Comment by Azat Khuzhin [ 04/Dec/12 ]

The next mongodb performance status available:

Mongo Connections
MongoDB Database Operations
MongoDB Memory Footprint
MongoDB Ops Counters
MongoDB Sum Total of Sizes of All Databases

Also there is other metrics like memory, disk, network and others.

Comment by David Hows [ 04/Dec/12 ]

Hi Azat,

Does your internal monitoring solution collect any of the mongodb performance stats from your instances? If so, which ones do you have available?

We would like to see some of these indicators within your system to determine what is happening when the replication lag occurs.

Cheers,

David

Comment by Azat Khuzhin [ 02/Dec/12 ]

No, we have internal monitoring system.
If you need some information like iowait, load avg, disk read/write you feel free too ask.

Thanks.

Comment by Eliot Horowitz (Inactive) [ 01/Dec/12 ]

Is the replica set in mms?

Generated at Thu Feb 08 03:15:39 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.