[SERVER-10474] How does replication work and what is the performance bottlenecks? Created: 09/Aug/13  Updated: 10/Dec/14  Resolved: 19/Aug/13

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.2.3
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Johnny Boy Assignee: Stennie Steneker (Inactive)
Resolution: Done Votes: 0
Labels: performance, replicaset
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu LTS


Participants:

 Description   

We've been having issues with replication lag.
Had to manually restart one of two slaves and use db.adminCommand(

{replSetMaintenance:true}

), let it sync up and then do the same for the other slave. Only then were both able to catch up with the oplog.

Why does it help to restart mongodb for it to start catching up when 1800+ seconds behind? Did not help to set maintenance mode.

What's the bottlenecks of replication?
Does all "repl writer workers" have to wait for the one writer thread to get the oplog replication done?
Did not seem to utilize much of the resources available when the replication was stalled.

Will the maximum capacity to replicate one database depend on how much one cpu core can handle? How do we monitor this limitation?



 Comments   
Comment by Stennie Steneker (Inactive) [ 19/Aug/13 ]

Hi Johnny,

As we haven't heard from you in a while I'm going to close this issue.

If you have further support questions, the community forums like mongodb-users and StackOverflow are a better starting point.

Thanks,
Stephen

Comment by Stennie Steneker (Inactive) [ 13/Aug/13 ]

Hi Johnny,

You should not need to set maintenance mode or restart the secondary in order to have replication continue successfully. By chance are you running any scripts or commands to kill long running operations on the server?

Can you open a ticket in the SUPPORT (Community Private) project and attach your mongod logs? Please reference SERVER-10474 in the description as well. Information in the SUPPORT project is private and only shared with the 10gen team.

As far as more details on replication mechanics, I would suggest reviewing the Replication documentation in the manual as well as:

Regards,
Stephen

Comment by Johnny Boy [ 12/Aug/13 ]

Hello!

Thank you for the sum up. I will use the user group next time.

Yes it is the hosts in the MMS group. The pattern is that when we do deploys / having peak time in traffic where the most amount of writes are being performed the replication lag keeps growing and have a hard time catching up without manual intervention.
It seemed strange that there was no progress in the amount of seconds behind even after setting maintenance mode. Only after restart did it finally progress.

Is there anything in particular I should look for in the logs?
Do you need the queries from the log or can I strip those out? There might be somewhat private data in there, if you need the whole shebang can I ask for a pgp key of yours?

As for the technicality on how the replication works, is there any detailed documentation I can read about that?

Thank you!

Comment by Stennie Steneker (Inactive) [ 12/Aug/13 ]

Hi Johnny,

The SERVER project is for reporting bugs or feature suggestions for the MongoDB server.

For MongoDB-related support discussion you should post on the mongodb-users group (http://groups.google.com/group/mongodb-user) or Stack Overflow. A question like this involving more discussion would be best posted on the mongodb-users group.

In regards to replication lag, you would need to be more specific in terms of what you are seeing. For example, is this sustained replication lag or just apparent short jumps in replication lag.

Assuming you are referring to hosts in the MMS group linked to your user account, it looks like you have had only one brief bump in replication lag over the past week rather than a sustained problem. Without seeing the full logs it is hard to know what else was happening at the time, but I would suspect that a resource issue such as networking or I/O contention could have affected your replication.

If you would like to attach your logs here for review we may be able to provide more insight. I would note that issues and attachments in the SERVER project are publicly visible though.

Regards,
Stephen

Generated at Thu Feb 08 03:23:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.