[DOCS-1064] Not clear what state a very stale replica node will exhibit Created: 29/Jan/13 Updated: 30/Oct/23 Resolved: 15/Apr/13 |
|
| Status: | Closed |
| Project: | Documentation |
| Component/s: | manual |
| Affects Version/s: | mongodb-2.2 |
| Fix Version/s: | Server_Docs_20231030 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Shaun Crampton | Assignee: | Bob Grabar |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: | |||
| Issue Links: |
|
||||||||
| Participants: | |||||||||
| Days since reply: | 11 years, 3 weeks, 1 day ago | ||||||||
| Description |
|
The manual says that if a mongo node gets too far behind and the oplog wraps then it won't be able to catch up and it will be stuck in that state without manual intervention. I'm trying to write a script to detect that state and alarm on my cluster but it's not clear how to detect it. The link above gives a list of the possible cluster member states (as reported by rs.status().members[n].status) but it's not clear how you can tell apart "stale but able to catch up" and "very stale, cannot catch up". Are those states distinguished, if so, what do they map to or do I need to look elsewhere? |
| Comments |
| Comment by Shaun Crampton [ 29/Jan/13 ] |
|
Thanks, that's very clear now. It's easy to understand how hard it is to handle the case where the replication simply isn't fast enough. Thankfully, I don't think that will be an issue in my application. |
| Comment by Sam Kleinman (Inactive) [ 29/Jan/13 ] |
|
If the instance can't catch up and must enter initial sync, then this operation happens automatically. You may be able to monitor replication lag using the value of optimeDate (http://docs.mongodb.org/manual/reference/replica-status/#replSetGetStatus.members.optimeDate) from all members of the set. To determine the "length" of the oplog in time you'll need to know the average size of each oplog entry and the frequency of operations, as well as the size of the oplog on all machines that may become primary. In many situations its difficult to predict what will happen with regard to replication lag. Under some loads, a replica could fall behind the state of the primary by an hour (say) during a large bulk operation and then reliably catch up. For other deployments, falling behind by more than 15 minutes might be unrecoverable via normal replication (though the conditions required to reproduce this are probably pathological.) During the rollback operation (which may take several moments) the members.state value (http://docs.mongodb.org/manual/reference/replica-status/#replSetGetStatus.members.state) will be "9" for rollback. While we will improve the documentation on this point, you should also be aware that MMS, which is a free service, monitors replication lag and can issue generic alerts that will alert on the kind of events that you need to know about for this case. |
| Comment by Shaun Crampton [ 29/Jan/13 ] |
|
Thanks for the response, that clarifies the behavior somewhat. The thing I'm still puzzling over is the best way to recognize the bad states that you mention. If the instance can't catch up because the most recent oplog entry isn't in the primary's oplog then does the instance indicate an error? Is there some other way to tell that an instance has got stuck in that state (e.g. can I somehow look at the oplog times for the different instances and compare them to work out that the instance will never catch up?) I think the rollback case is easier to spot because I can check for a rollback log. |
| Comment by Sam Kleinman (Inactive) [ 29/Jan/13 ] |
|
We'll make a note to clarify the language here. The problem is less cut and dry: the oplog is a fixed size, that depends on the amount of free disk space when the member was added to the replica set (or when the set was initiated for the first member.) Replication is asynchronous, and a member can stop fetching operations from the oplog and then resume replicating normally as long as the most recent entry in a member's oplog still exists in the primary's oplog. If the member's most recent oplog entry is not in the primary's oplog, then there's no way for the normal replication process to ensure that all members of the database have identical data set. So we say that the member can't catch up and must run the "initial sync" routine which allows it to copy the data state from a known "up to date" member of the set and then apply the collected oplog entries. The result is a member with an equivalent data set as all other members. As the name implies "initial sync" is the process replica set members use to synchronize when they are first added to the set. If initial sync cannot complete within the "oplog window," which is to say if the first oplog entry copied during the sync, is not present in the primary's oplog when the initial sync is complete and it begins applying oplog entries, then the replica set member will be stuck in a sort of endless loop unable to obtain a data set equivalent to the other members of the set. To remedy this you will need to reduce the rate of operations on the primary during the initial sync, or you will need to find a way of syncing the member more quickly (i.e. copying data files from a snapshot of a working secondary, etc.) The only part of this process that may systematically require manual intervention is when a member of a replica set enters "rollback" state, or where the member having previously been primary is disconnected from the set. If the primary accepts write operations before it realizes that its secondaries are no longer connected and the secondaries elect a new primary, then there are a number of operations on the old primary that aren't in the new primary and therefore the rest of the set. In this situation the old primary will "undo" or "rollback" the operations that it has that are not present in the "new" set. The old primary writes these documents out in .bson files in the dbpath, and you can inspect and save (or not) the data as needed. Does this help? |